Tuesday, March 8, 2011

C++ Console Application to Get Comments from a Microsoft Word File

Output from Comment Extraction
The goal of this post is to show how to construct a C++ console application that will extract comments from a Word document. This post builds on a previous post which showed extracting comments from a Microsoft Word document (2007 or greater). In the previous post, Getting Comments from a Microsoft Word File: Leveraging the OPC Format, we did the extraction by changing the extension of the Word document and accessing the files directly in the ZIP structure. In this post, we take the Word document as is and use a console application written in C++/COM and leveraging the OPC API to directly access the comments. The code shown here was run in Visual Studio 2010 on Windows 7.

The key to the console application logic is to understand the document parts of the Word XML format. When we crack open the Word ZIP file we could get the comments file directly. Using the API we have to follow the pattern set out in the API. The pattern for a Word document is discussed here on MSDN and here. The main document part (../word/document.xml) is the main part in the package and that the comments part (../word/comments.xml) has a relationship to the main document part that can be used to obtain the coments. On our first try, we kept trying to get the comments part directly from the package relationships which didn't work. However, once we got the document part from the package (see the FindPartByRelationshipType method in the program below), we then could use the same logic to get the comments part from the document part.

A crucial part of the console application are the definitions of content types and relationship types of parts to parts. These definitions are defined in the header file (ExtractComments.h) for this application. For example, the content type of the comments part is:

application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml

The relationship of the comments part to the document part:

http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments

Note: In this console application we did not deal with the fact that comments in a Word document can contain more than just text. In the previous post we did deal with hyperlinks as example of content besides text in comments. These improvements to this code would need to be added here. Specifically, if you look at the ECMA-376 part1 for the docx format, you can find the details of what a comment can contain and it includes charts, diagrams, hyperlinks, images, video, and embedded content.

The code shown here was build starting from the SDK samples provides with the OPC SDK Samples for Windows 7. In particular we started from the SetAuthor project inside of the AllOPCSamples.zip. We changed the SetAuthor program to suit our purpose here. The console application takes a file name as an argument. In Visual Studio, set the file name under the configuration properties of the project as shown below.

Visual Studio Console App Configuration

The code is shown below and as well as links for downloading it. Before getting to the code here is a sketch of the pseudo-logic of the code. We use the syntax of (x,y) -> z to mean x and y are used to return z. A bit simplistic, but helps clarify what is coming in and what is going out.

//pseudo-code
wmain
COM Initilization of Thread
CoCreateInstance of Opc Factory : () -> factory
Load Package : (factory, fileName) -> package
Find Document Part in Package : (package) -> documentPart
Find Comments Part in Package : (package, documentPart) -> commentsPart
Print Core Properties (package) -> output
Print Comments (commentsPart) -> output

Load Package
(factory, fileName) -> package
Create Stream on File : (factory, fileName, options) -> sourceFileStream
Read Package from Stream : (factory, sourceFileStram, options) -> package

Find Document In Package
(package) -> documentPart
relationshipType = http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument
contentType = application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml
Find Part by Relationship Type : (package, NULL, relationshipType, contentType) -> documentPart

Find Core Properties Part
(package) -> documentPart
relationshipType = http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties
contentType = application/vnd.openxmlformats-package.core-properties+xml
Find Part by Relationship Type : (package, NULL, relationshipType, contentType) -> documentPart

Find Comments in Package
(package, documentPart) -> commentsPart
relationshipType = http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments
contentType = application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml
Find Part by Relationship Type* : (package, documentPart, relationshipType, contentType) -> commentsPart

Find Part By Relationship Type
(package, parentPart, relationshipType, contentType) -> part
Get Part Set : (package) -> partSet
Get Relationship Set
if (parentPart == NULL) then (package) -> packageRels
else (parentPart) -> packageRels
Get Enumerator for Type : (packageRels, relationshipType) -> packeRelsEnum
Get Current : (packageRelsEnum) -> currentRel
Resolve Target Uri to Part : (currentRel) -> partUri
Part Exists : (partSet, partUri) -> partExists
if (partExists) {
Get Current Part : (partSet, partUri) -> currentPart
Get Current Part Content Type : (currentPart) -> currentContentType
if (currentContentType equals contentType)
{ // found the part }
}

Resolve Target URI to Part
(relationship) -> resolvedUri

Print Comments
(commentsPart) -> output
Get DOM from Part : (commentsParts, namespace) -> commentsDom
Select Nodes : (commentsDom) -> commentsNodeList
for each {
Get Attributes of Comment Node
Get Text of Comment Node
}

Get Text of Comment Node
(node) -> output

Get Attributes of Comment Node
(node) -> output

Print Core Properties
(package) -> output
Find Core Properties : (package) -> corePropertiesPart
Get DOM from Part : (corePropertiesPart, namespace) -> corePropertiesDom
Select Single Node : (corePropertiesDom, nodeName) -> nodeFound
// work with nodeFound

Get DOM from Part
(part, namespace) -> XmlDocument



The header file for the console application can be downloaded here and is shown below.
#include "msopc.h"
#include "msxml6.h"
#include "stdafx.h"

HRESULT LoadPackage(IOpcFactory *factory, LPCWSTR packageName, IOpcPackage **outPackage);
HRESULT FindDocumentInPackage(IOpcPackage *package, IOpcPart **documentPart);
HRESULT FindCommentsInPackage(IOpcPackage *package, IOpcPart *parentPart, IOpcPart **documentPart);
HRESULT FindPartByRelationshipType(IOpcPackage *package, IOpcPart *parentPart, LPCWSTR relationshipType, LPCWSTR contentType, IOpcPart **part);
HRESULT ResolveTargetUriToPart(IOpcRelationship *relativeUri, IOpcPartUri **resolvedUri);
HRESULT PrintCoreProperties(IOpcPackage *package);
HRESULT PrintComments(IOpcPart *part);
HRESULT GetAttributesOfCommentNode(IXMLDOMNode *node);
HRESULT GetTextofCommentNode(IXMLDOMNode *node);
HRESULT FindCorePropertiesPart(IOpcPackage *package, IOpcPart **part);
HRESULT DOMFromPart(IOpcPart *part, LPCWSTR selectionNamespaces, IXMLDOMDocument2 **document);

static const WCHAR g_officeDocumentRelationshipType[] =
L"http://schemas.openxmlformats.org/officeDocument/2006/relationships/officeDocument";
static const WCHAR g_wordProcessingContentType[] =
L"application/vnd.openxmlformats-officedocument.wordprocessingml.document.main+xml";
static const WCHAR g_corePropertiesRelationshipType[] =
L"http://schemas.openxmlformats.org/package/2006/relationships/metadata/core-properties";
static const WCHAR g_corePropertiesContentType[] =
L"application/vnd.openxmlformats-package.core-properties+xml";
static const WCHAR g_commentsRelationshipType[] =
L"http://schemas.openxmlformats.org/officeDocument/2006/relationships/comments";
static const WCHAR g_commentsContentType[] =
L"application/vnd.openxmlformats-officedocument.wordprocessingml.comments+xml";
static const WCHAR g_corePropertiesSelectionNamespaces[] =
L"xmlns:cp='http://schemas.openxmlformats.org/package/2006/metadata/core-properties' "
L"xmlns:dc='http://purl.org/dc/elements/1.1/' "
L"xmlns:dcterms='http://purl.org/dc/terms/' "
L"xmlns:dcmitype='http://purl.org/dc/dcmitype/' "
L"xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance'";
static const WCHAR g_commentsSelectionNamespaces[] =
L"xmlns:wpc='http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas' "
L"xmlns:mc='http://schemas.openxmlformats.org/markup-compatibility/2006' "
L"xmlns:o='urn:schemas-microsoft-com:office:office' "
L"xmlns:r='http://schemas.openxmlformats.org/officeDocument/2006/relationships' "
L"xmlns:m='http://schemas.openxmlformats.org/officeDocument/2006/math' "
L"xmlns:v='urn:schemas-microsoft-com:vml' "
L"xmlns:wp14='http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing' "
L"xmlns:wp='http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing' "
L"xmlns:w10='urn:schemas-microsoft-com:office:word' "
L"xmlns:w='http://schemas.openxmlformats.org/wordprocessingml/2006/main' "
L"xmlns:w14='http://schemas.microsoft.com/office/word/2010/wordml' "
L"xmlns:wpg='http://schemas.microsoft.com/office/word/2010/wordprocessingGroup' "
L"xmlns:wpi='http://schemas.microsoft.com/office/word/2010/wordprocessingInk' "
L"xmlns:wne='http://schemas.microsoft.com/office/word/2006/wordml' "
L"xmlns:wps='http://schemas.microsoft.com/office/word/2010/wordprocessingShape' ";


The main code file for the console application can be download here and is shown below.
// ExtractComments.cpp : Defines the entry point for the console application.

#include "ExtractComments.h"
#include "stdio.h"
#include "windows.h"
#include "shlobj.h"
#include
#include "util.h"
using namespace std;

int wmain(int argc, wchar_t* argv[])
{
if (argc != 2)
{
wprintf(L"Usage: ExtractComments.exe \n");
exit(0);
}
wprintf(L"Starting.\n");
LPCWSTR pFileName = argv[1];
HRESULT hr = CoInitializeEx(NULL, COINIT_MULTITHREADED);

if (SUCCEEDED(hr))
{
IOpcPackage * package = NULL;
IOpcPart * documentPart = NULL;
IOpcFactory * factory = NULL;
hr = CoCreateInstance(
__uuidof(OpcFactory),
NULL,
CLSCTX_INPROC_SERVER,
__uuidof(IOpcFactory),
(LPVOID*)&factory
);
if (SUCCEEDED(hr))
{
wprintf(L"Created factory.\n");
hr = ::LoadPackage(factory, pFileName, &package);
// See command arguments in project properties for specification of file to read.
}
if (SUCCEEDED(hr))
{
wprintf(L"Loaded package.\n");
hr = ::FindDocumentInPackage(package, &documentPart);

}
IOpcPart *commentsPart;
if (SUCCEEDED(hr))
{
wprintf(L"Found document in package.\n");
hr = ::FindCommentsInPackage(package, documentPart, &commentsPart);
}
if (SUCCEEDED(hr))
{
wprintf(L"Found comments in package.\n");
hr = ::PrintCoreProperties(package);
}
if (SUCCEEDED(hr))
{
wprintf(L"Found core properties in package.\n");
hr = ::PrintComments(commentsPart);
}
if (SUCCEEDED(hr))
{
wprintf(L"Found comments in package.\n");
}

// Release resources
if (package)
{
package->Release();
package = NULL;
}

if (documentPart)
{
documentPart->Release();
documentPart = NULL;
}

if (factory)
{
factory->Release();
factory = NULL;
}
CoUninitialize();
}
return 0;
}

HRESULT LoadPackage(
IOpcFactory *factory,
LPCWSTR packageName,
IOpcPackage **outPackage)
{
IStream * sourceFileStream = NULL;
HRESULT hr = factory->CreateStreamOnFile(
packageName,
OPC_STREAM_IO_READ,
NULL,
0,
&sourceFileStream);
if (SUCCEEDED(hr))
{
hr = factory->ReadPackageFromStream(
sourceFileStream,
OPC_CACHE_ON_ACCESS,
outPackage);
}
if (sourceFileStream)
{
sourceFileStream ->Release();
sourceFileStream = NULL;
}
return hr;
}
HRESULT FindDocumentInPackage(
IOpcPackage *package,
IOpcPart **documentPart)
{
return ::FindPartByRelationshipType(
package,
NULL,
g_officeDocumentRelationshipType,
g_wordProcessingContentType,
documentPart);

}
HRESULT FindCommentsInPackage(
IOpcPackage *package,
IOpcPart *documentPart,
IOpcPart **commentsPart)
{
return ::FindPartByRelationshipType(
package,
documentPart,
g_commentsRelationshipType,
g_commentsContentType,
commentsPart);

}
HRESULT FindCorePropertiesPart(
IOpcPackage * package,
IOpcPart **part)
{
return ::FindPartByRelationshipType(
package,
NULL,
g_corePropertiesRelationshipType,
g_corePropertiesContentType,
part);
}
HRESULT FindPartByRelationshipType(
IOpcPackage *package,
IOpcPart *parentPart,
LPCWSTR relationshipType,
LPCWSTR contentType,
IOpcPart **part)
{
*part = NULL;
IOpcRelationshipSet * packageRels = NULL;
IOpcRelationshipEnumerator * packageRelsEnum = NULL;
IOpcPartSet * partSet = NULL;
BOOL hasNext = false;

HRESULT hr = package->GetPartSet(&partSet);

if (SUCCEEDED(hr))
{
if (parentPart == NULL)
{
hr = package->GetRelationshipSet(&packageRels);
}
else
{
hr = parentPart->GetRelationshipSet(&packageRels);
}
}
if (SUCCEEDED(hr))
{
hr = packageRels->GetEnumeratorForType(
relationshipType,
&packageRelsEnum);
}
if (SUCCEEDED(hr))
{
hr = packageRelsEnum->MoveNext(&hasNext);
}
while (SUCCEEDED(hr) && hasNext && *part == NULL)
{
IOpcPartUri * partUri = NULL;
IOpcRelationship * currentRel = NULL;
BOOL partExists = FALSE;

hr = packageRelsEnum->GetCurrent(¤tRel);
if (SUCCEEDED(hr))
{
hr = ::ResolveTargetUriToPart(currentRel, &partUri);
}
if (SUCCEEDED(hr))
{
hr = partSet->PartExists(partUri, &partExists);
}
if (SUCCEEDED(hr) && partExists)
{
LPWSTR currentContentType = NULL;
IOpcPart * currentPart = NULL;
hr = partSet->GetPart(partUri, ¤tPart);
IOpcPartUri * name = NULL;
currentPart->GetName(&name);
BSTR displayUri = NULL;
name->GetDisplayUri(&displayUri);
wprintf(L"currentPart: %s\n", displayUri);
if (SUCCEEDED(hr) && contentType != NULL)
{
hr = currentPart->GetContentType(¤tContentType);
wprintf(L"contentType: %s\n", currentContentType);
if (SUCCEEDED(hr) && 0 == wcscmp(contentType, currentContentType))
{
*part = currentPart; // found what we are looking for
currentPart = NULL;
}
}
if (SUCCEEDED(hr) && contentType == NULL)
{
*part = currentPart;
currentPart = NULL;
}
CoTaskMemFree(static_cast(currentContentType));
if (currentPart)
{
currentPart->Release();
currentPart = NULL;
}
}
if (SUCCEEDED(hr))
{
hr = packageRelsEnum->MoveNext(&hasNext);
}
if (partUri)
{
partUri->Release();
partUri = NULL;
}

if (currentRel)
{
currentRel->Release();
currentRel = NULL;
}
}
if (SUCCEEDED(hr) && *part == NULL)
{
// Loop complete without errors and no part found.
hr = E_FAIL;
}

// Release resources
if (packageRels)
{
packageRels->Release();
packageRels = NULL;
}

if (packageRelsEnum)
{
packageRelsEnum->Release();
packageRelsEnum = NULL;
}

if (partSet)
{
partSet->Release();
partSet = NULL;
}
return hr;
}
HRESULT ResolveTargetUriToPart(
IOpcRelationship *relationship,
IOpcPartUri **resolvedUri
)
{
IOpcUri * sourceUri = NULL;
IUri * targetUri = NULL;
OPC_URI_TARGET_MODE targetMode;
HRESULT hr = relationship->GetTargetMode(&targetMode);
if (SUCCEEDED(hr) && targetMode != OPC_URI_TARGET_MODE_INTERNAL)
{
return E_FAIL;
}
if (SUCCEEDED(hr))
{
hr = relationship->GetTargetUri(&targetUri);
}
if (SUCCEEDED(hr))
{
hr = relationship->GetSourceUri(&sourceUri);
}
if (SUCCEEDED(hr))
{
hr = sourceUri->CombinePartUri(targetUri, resolvedUri);
}
if (sourceUri)
{
sourceUri->Release();
sourceUri = NULL;
}
if (targetUri)
{
targetUri->Release();
targetUri = NULL;
}
return hr;
}
HRESULT PrintComments(
IOpcPart *commentsPart)
{
IXMLDOMDocument2 * commentsDom = NULL;

HRESULT hr = ::DOMFromPart(
commentsPart,
g_commentsSelectionNamespaces,
&commentsDom);
if (SUCCEEDED(hr))
{
IXMLDOMNodeList * commentsNodeList = NULL;
BSTR text = NULL;
hr = commentsDom->selectNodes(
L"//w:comment",
&commentsNodeList);
if (SUCCEEDED(hr) && commentsNodeList != NULL)
{
// Iterate through comment nodes
// http://msdn.microsoft.com/en-us/library/ms757073(VS.85).aspx
long nodeListLength = NULL;
hr = commentsNodeList->get_length(&nodeListLength);

for (int i = 0; i < item =" NULL;" hr =" commentsNodeList-">get_item(i, &item);
SUCCEEDED(hr) ? 0 : throw hr;

::GetAttributesOfCommentNode(item);
::GetTextofCommentNode(item);
}

}
// Release resources
if (commentsNodeList)
{
commentsNodeList->Release();
commentsNodeList = NULL;
}
}
// Release resources
if (commentsPart)
{
commentsPart->Release();
commentsPart = NULL;
}

if (commentsDom)
{
commentsDom->Release();
commentsDom = NULL;
}

return hr;
}
HRESULT GetTextofCommentNode(
IXMLDOMNode *node
)
{
BSTR bstrQueryString1 = ::SysAllocString(L"w:p");
BSTR bstrQueryString2 = ::SysAllocString(L"w:r");
BSTR commentText = NULL;
IXMLDOMNodeList *resultList1 = NULL;
IXMLDOMNodeList *resultList2 = NULL;
IXMLDOMNode *pNode, *rNode = NULL;

long resultLength1, resultLength2;

HRESULT hr = node->selectNodes(bstrQueryString1, &resultList1);
SUCCEEDED(hr) ? 0 : throw hr;
hr = resultList1->get_length(&resultLength1);
if (SUCCEEDED(hr))
{
resultList1->reset();
for (int i = 0; i <>get_item(i, &pNode);
if (pNode)
{
//wprintf(L"--Found a w:p node.\n");
wprintf(L"\n");
pNode->selectNodes(bstrQueryString2, &resultList2);
SUCCEEDED(hr) ? 0 : throw hr;
hr = resultList2->get_length(&resultLength2);
if (SUCCEEDED(hr))
{
resultList2->reset();
for (int j = 0; j <>get_item(j, &rNode);
if (rNode)
{
rNode->get_text(&commentText);
//wprintf(L"----Found a w:r node. \n");
wprintf(commentText);
}
}
}

}
}
}

::SysFreeString(bstrQueryString1); ::SysFreeString(bstrQueryString2);
bstrQueryString1 = NULL; bstrQueryString2 = NULL;
resultList1->Release(); resultList2->Release();
resultList1 = NULL; resultList2 = NULL;
pNode->Release(); rNode->Release();
pNode = NULL; rNode = NULL;
return hr;
}
HRESULT GetAttributesOfCommentNode(
IXMLDOMNode *node
)
{
VARIANT commentAuthorStr, commentDateStr;
BSTR bstrAttributeAuthor = ::SysAllocString(L"w:author");
BSTR bstrAttributeDate = ::SysAllocString(L"w:date");

// Get author and date attribute of the item.
//http://msdn.microsoft.com/en-us/library/ms767592(VS.85).aspx
IXMLDOMNamedNodeMap *attribs = NULL;
IXMLDOMNode *AttrNode = NULL;
HRESULT hr = node->get_attributes(&attribs);
if (SUCCEEDED(hr) && attribs)
{
attribs->getNamedItem(bstrAttributeAuthor, &AttrNode);
if (SUCCEEDED(hr) && AttrNode)
{
AttrNode->get_nodeValue(&commentAuthorStr);
}
AttrNode->Release();
AttrNode = NULL;
attribs->getNamedItem(bstrAttributeDate, &AttrNode);
if (SUCCEEDED(hr) && AttrNode)
{
AttrNode->get_nodeValue(&commentDateStr);
}
AttrNode->Release();
AttrNode = NULL;
}
attribs->Release();
attribs = NULL;

wprintf(L"\n-------------------------------------------------");
wprintf(L"\nComment::\nAuthor: %s, Date: %s\n", commentAuthorStr.bstrVal, commentDateStr.bstrVal);

::SysFreeString(bstrAttributeAuthor); ::SysFreeString(bstrAttributeDate);
bstrAttributeAuthor = NULL; bstrAttributeDate = NULL;

return hr;
}
HRESULT PrintCoreProperties(
IOpcPackage *package)
{
IOpcPart * corePropertiesPart = NULL;
IXMLDOMDocument2 * corePropertiesDom = NULL;

HRESULT hr = ::FindCorePropertiesPart(
package,
&corePropertiesPart);
if (SUCCEEDED(hr))
{
hr = ::DOMFromPart(
corePropertiesPart,
g_corePropertiesSelectionNamespaces,
&corePropertiesDom);
}
if (SUCCEEDED(hr))
{
IXMLDOMNode * creatorNode = NULL;
BSTR text = NULL;
hr = corePropertiesDom->selectSingleNode(
L"//dc:creator",
&creatorNode);
if (SUCCEEDED(hr) && creatorNode != NULL)
{
hr = creatorNode->get_text(&text);
}
if (SUCCEEDED(hr))
{
wprintf(L"Author: %s\n", (text != NULL) ? text : L"[missing author info]");
}
// Release resources
if (creatorNode)
{
creatorNode->Release();
creatorNode = NULL;
}

SysFreeString(text);

// put other code here to read other properties
}
// Release resources
if (corePropertiesPart)
{
corePropertiesPart->Release();
corePropertiesPart = NULL;
}

if (corePropertiesDom)
{
corePropertiesDom->Release();
corePropertiesDom = NULL;
}
return hr;
}

HRESULT DOMFromPart(
IOpcPart * part,
LPCWSTR selectionNamespaces,
IXMLDOMDocument2 **document)
{
IXMLDOMDocument2 * partContentXmlDocument = NULL;
IStream * partContentStream = NULL;

HRESULT hr = CoCreateInstance(
__uuidof(DOMDocument60),
NULL,
CLSCTX_INPROC_SERVER,
__uuidof(IXMLDOMDocument2),
(LPVOID*)&partContentXmlDocument);
if (SUCCEEDED(hr) && selectionNamespaces)
{
AutoVariant v;
hr = v.SetBSTRValue(L"XPath");
if (SUCCEEDED(hr))
{
hr = partContentXmlDocument->setProperty(L"SelectionLanguage", v);
}
if (SUCCEEDED(hr))
{
AutoVariant v;
hr = v.SetBSTRValue(selectionNamespaces);
if (SUCCEEDED(hr))
{
hr = partContentXmlDocument->setProperty(L"SelectionNamespaces", v);
}
}
}
if (SUCCEEDED(hr))
{
hr = part->GetContentStream(&partContentStream);
}
if (SUCCEEDED(hr))
{
VARIANT_BOOL isSuccessful = VARIANT_FALSE;
AutoVariant vStream;
vStream.SetObjectValue(partContentStream);
hr = partContentXmlDocument->load(vStream, &isSuccessful);
if (SUCCEEDED(hr) && isSuccessful == VARIANT_FALSE)
{
hr = E_FAIL;
}
}
if (SUCCEEDED(hr))
{
*document = partContentXmlDocument;
partContentXmlDocument = NULL;
}
// Release resources
if (partContentXmlDocument)
{
partContentXmlDocument->Release();
partContentXmlDocument = NULL;
}

if (partContentStream)
{
partContentStream->Release();
partContentStream = NULL;
}
return hr;
}

2 comments:

  1. I have to thank you working so hard. This post shows the brilliance in your work. The given post explains C++ console applications to get comments from Microsoft Word file. The coding seems difficult so I just avoid them.

    ReplyDelete