How to detect page breaks using python-docx from docx

I have several .docx files that contain several similar blocks of text: docx files containing more than 300 press releases, each of 1-2 pages, which must be divided into separate text files. The only consistent way to talk about the differences between articles is always and only a page break between two articles.

However, I don’t know how to find page breaks when converting spanning Word documents to text, and page break information is lost after conversion using the current script

I want to know how to save HARD page breaks when converting a .docx file to .txt. It doesn't matter to me how they look in a text file if they are uniquely identified when scanning the text file later

Here is the script I use to convert docx files to txt:

def docx2txt(file_path):
    document = opendocx(file_path)
    text_file = open("%s.txt" % file_path[:len(file_path)-5], "w")
    paratextlist = getdocumenttext(document)
    newparatextlist = []
    for paratext in paratextlist:
        newparatextlist.append(paratext.encode("utf-8"))
    text_file.write('\n\n'.join(newparatextlist))
    text_file.close()
+4
source share
1 answer

A hard page break will appear as an element <w:br>in the run ( <w:r>) element , something like this:

<w:p>
  <w:r>
    <w:t>some text</w:t>
    <w:br w:type="page"/>
  </w:r>
</w:p>

Thus, one approach would be to replace all of these occurrences with a distinctive line of text, for example, "{{foobar}}".

The implementation of this will be something like this:

from lxml import etree
from docx import nsprefixes

page_br_elements = document.xpath(
    "//w:p/w:r/w:br[@w:type='page']", namespaces={'w': nsprefixes['w']}
)
for br in page_br_elements:
    t = etree.Element('w:t', nsmap={'w': nsprefixes['w']})
    t.text = '{{foobar}}'
    br.addprevious(t)
    parent = br.getparent()
    parent.remove(br)

, - , , , docx. - lxml _Element.

, , , .

+3

Source: https://habr.com/ru/post/1544512/


All Articles