How can I find a word in a Word 2007.docx file?

Question

How can I find a word in a Word 2007.docx file?

I would like to find a Word 2007 file (.docx) for a text string, for example, “some special phrase” that could / would be found when searching in Word.

Is there any way from Python to view the text? I am not interested in formatting - I just want to classify documents as having or not having “some special phrase”.

+43

python ms-word openxml docx

Gerry Sep 22 '08 at 17:08

source share

10 answers

After reading your post above, I created a 100% native Python docx module to solve this particular problem.

# Import the module from docx import * # Open the .docx file document = opendocx('A document.docx') # Search returns true if found search(document,'your search string')

The docx module is located at https://python-docx.readthedocs.org/en/latest/

+139

mikemaccana Dec 30 '09 at 12:08

source share

In this example, “Course Outline.docx” is a Word 2007 document that contains the word “Windows” and does not contain the phrase “another random line.”

 >>> import zipfile >>> z = zipfile.ZipFile("Course Outline.docx") >>> "Windows" in z.read("word/document.xml") True >>> "random other string" in z.read("word/document.xml") False >>> z.close()

Basically, you just open the docx file (which is a zip archive) with a zipfile and find the contents in the .xml 'document in the word folder. If you would like to be more complex, you could parse XML , but if you are just looking for a phrase (which you know, t is a tag), then you can just look in XML for the string.

+16

Tony Meyer Sep 22 '08 at 23:12

source share

The problem with searching inside the XML file of a Word document is that the text can be divided into elements of any character. It will be split if the formatting is different, for example in Hello World . But it can be shared at any point and is valid in OOXML. This way you get to deal with XML like this, even if the formatting does not change in the middle of a phrase!

 <w:pw:rsidR="00C07F31" w:rsidRDefault="003F6D7A"> <w:rw:rsidRPr="003F6D7A"> <w:rPr> <w:b /> </w:rPr> <w:t>Hello</w:t> </w:r> <w:r> <w:t xml:space="preserve">World.</w:t> </w:r> </w:p>

You can, of course, load it into the XML DOM tree (not sure if it will be in Python) and ask to get the text only as a string, but you can get many other dead ends just because the OOXML spec is about 6000 pages, and MS Word can write many “things” that you do not expect. This way you can write your own document processing library.

Or you can try using Aspose.Words .

It is available in both .NET and Java products. Both can be used from Python. One through COM Interop the other through JPype. See Aspose.Words Programming Guide, Utilize Aspose.Words in other programming languages (sorry, I cannot post the second link, stackoverflow still does not allow me).

+14

romeok Nov 15 '09 at 11:01

source share

Docx is just a zip archive with lots of files inside. Maybe you can look at some of these files? Other than that, you probably have to find a lib that understands the word format so that you can filter out what doesn't interest you.

A second choice would be to interact with the word and search through it.

+4

kokos Sep 22 '08 at 17:16

source share

You can use docx2txt to get the text inside docx than search in this txt

 npm install -g docx2txt docx2txt input.docx # This will print the text to stdout

+4

edi9999 Dec 09 '14 at 10:51

source share

A docx file is essentially a zip file with the xml inside it.
xml contains formatting but also contains text.

+2

shoosh Sep 22 '08 at 17:17

source share

Automating OLE is likely to be the easiest. You should consider formatting because the text may look like this in XML:

 <b>Looking <i>for</i> this <u>phrase</u>

There is no easy way to find this using a simple text scan.

+1

ilitirit Sep 22 '08 at 23:03

source share

You should be able to use the MSWord ActiveX interface to extract text for search (or possibly search). I have no idea how you access ActiveX from Python.

0

Andy Brice Sep 22 '08 at 17:17

source share

You can also use the library from OpenXMLDeveloper.org

0

billb Oct 18 '08 at 19:34

source share

PhiLho · Accepted Answer · 2008-09-22 17:22

More precisely, the .docx document is a Zip archive in the OpenXML format: first you need to unzip it.
I downloaded a sample (Google: some search term filetype: docx), and after unpacking I found several folders. The word folder contains the document itself in the document.xml file.

How can I find a word in a Word 2007.docx file?

More articles: