The problem with searching inside the XML file of a Word document is that the text can be divided into elements of any character. It will be split if the formatting is different, for example in Hello World . But it can be shared at any point and is valid in OOXML. This way you get to deal with XML like this, even if the formatting does not change in the middle of a phrase!
<w:pw:rsidR="00C07F31" w:rsidRDefault="003F6D7A"> <w:rw:rsidRPr="003F6D7A"> <w:rPr> <w:b /> </w:rPr> <w:t>Hello</w:t> </w:r> <w:r> <w:t xml:space="preserve">World.</w:t> </w:r> </w:p>
You can, of course, load it into the XML DOM tree (not sure if it will be in Python) and ask to get the text only as a string, but you can get many other dead ends just because the OOXML spec is about 6000 pages, and MS Word can write many “things” that you do not expect. This way you can write your own document processing library.
Or you can try using Aspose.Words .
It is available in both .NET and Java products. Both can be used from Python. One through COM Interop the other through JPype. See Aspose.Words Programming Guide, Utilize Aspose.Words in other programming languages (sorry, I cannot post the second link, stackoverflow still does not allow me).
romeok Nov 15 '09 at 11:01 2009-11-15 11:01
source share