How to iterate over everything in a python-docx document?

Question

How to iterate over everything in a python-docx document?

I am using python-docx to convert Word docx to my own HTML equivalent. The document I need to convert has images and tables, but I could not figure out how to access the images and tables in this run. That's what I think...

 for para in doc.paragraphs: for run in para.runs: # How to tell if this run has images or tables?

... but I don't see anything on Run , which has information about InlineShape or Table . Should I go back to XML directly or is there a better, cleaner way to iterate over everything in a document?

Thanks!

+6

python python-docx

thebitguru Aug 05 '14 at 3:57

source share

2 answers

Assuming the doc is of type Document , what you want to do consists of 3 separate iterations:

One for paragraphs as you have in code
One for tables via doc.tables
One for figures, through doc.inline_shapes

The reason your code didn't work is because the paragraphs don't have links to tables and / or shapes in the document, because they are stored in the Document object.

Here is the documentation for more information: python-docx

0

mleyfman Aug 05 '14 at 4:27

source share

scanny · Accepted Answer · 2014-08-05T05:27:03+0000

There are actually two problems that need to be solved for what you are trying to do. The first is iterating over all the elements of a block level in a document in document order. The second is iterating over all the built-in elements within each element of the block in the order they appear.

python-docx does not yet have functions that you will need to do directly. However, for the first problem, there is an example code here that is likely to work for you: https://github.com/python-openxml/python-docx/issues/40

There is no exact instance that I know to deal with inline elements, but I expect that you can get pretty far with paragraph.runs. All embedded data will be within the paragraph. If you got most of the way and just hung up to receive images or something, you could go down to the lxml level and decode the XML part to get what you need. If you get this far and still passionate, if you put a feature request in the GitHub problem list for something like “feature: Paragraph.iter_inline_items ()”, I can probably provide you with some similar code to get what you need.

This requirement arises from time to time, so we will definitely want to add it at some point.

Note that block-level elements (paragraphs and tables in general) can be displayed recursively, and this will require a general solution. In particular, a paragraph can (and in fact at least one should always) appear in a table cell. A table can also be displayed in a table cell. Therefore, theoretically, it can become quite deep. A recursive function / method is the right approach to access all of these.

How to iterate over everything in a python-docx document?

More articles: