How to access OpenXML content by page number?

Using OpenXML, can I read the contents of a document by page number?

wordDocument.MainDocumentPart.Document.Body provides the full text of the document.

  public void OpenWordprocessingDocumentReadonly() { string filepath = @"C:\...\test.docx"; // Open a WordprocessingDocument based on a filepath. using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filepath, false)) { // Assign a reference to the existing document body. Body body = wordDocument.MainDocumentPart.Document.Body; int pageCount = 0; if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null) { pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text); } for (int i = 1; i <= pageCount; i++) { //Read the content by page number } } } 

MSDN Link


Update 1:

it looks like page breaks are set below

 <w:pw:rsidR="003328B0" w:rsidRDefault="003328B0"> <w:r> <w:br w:type="page" /> </w:r> </w:p> 

So now I need to split the XML with the validation above and take InnerTex for each, which will give me text with the page type.

Now the question is, how can I split the XML with the validation above?


Update 2:

Page breaks are only set when you have page breaks, but if text floats from one page to other pages, then there is no page break XML element, so it returns to the same call as identifying page breaks.

+5
source share
4 answers

Here is how I did it.

  public void OpenWordprocessingDocumentReadonly() { string filepath = @"C:\...\test.docx"; // Open a WordprocessingDocument based on a filepath. Dictionary<int, string> pageviseContent = new Dictionary<int, string>(); int pageCount = 0; using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filepath, false)) { // Assign a reference to the existing document body. Body body = wordDocument.MainDocumentPart.Document.Body; if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null) { pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text); } int i = 1; StringBuilder pageContentBuilder = new StringBuilder(); foreach (var element in body.ChildElements) { if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0) { pageContentBuilder.Append(element.InnerText); } else { pageviseContent.Add(i, pageContentBuilder.ToString()); i++; pageContentBuilder = new StringBuilder(); } if (body.LastChild == element && pageContentBuilder.Length > 0) { pageviseContent.Add(i, pageContentBuilder.ToString()); } } } } 

Downside: This does not work in all scenarios. This will only work if you have a page break, but if you have text expanded from page 1 to page 2, there is no identifier to know that you are on the second page.

+1
source

You cannot link to OOXML content using page numbering only at the OOXML data level.

  • Hard page breaks are not a problem; hard page breaks may be considered.
  • Page break is a problem. They are calculated according to the line break and pagination algorithms, which are dependent implementations; it is not an integral part of OOXML data. There is nothing to count.

What about w:lastRenderedPageBreak , which is a record of the break position of a soft page during the last render of a document? No, w:lastRenderedPageBreak does not help at all, because :

If you are ready to accept the dependence on Word Automation, with all its inherent licenses and restrictions on the use of the server , then you have the ability to determine page borders, page numbering, page number, etc.

Otherwise, the only real answer is to go beyond page binding, which depends on proprietary implementation-specific paging algorithms.

+4
source

Rename docx to zip. Open the docProps \ app.xml file. :

  <?xml version="1.0" encoding="UTF-8" standalone="yes"?> <Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes"> <Template>Normal</Template> <TotalTime>0</TotalTime> <Pages>1</Pages> <Words>141</Words> <Characters>809</Characters> <Application>Microsoft Office Word</Application> <DocSecurity>0</DocSecurity> <Lines>6</Lines> <Paragraphs>1</Paragraphs> <ScaleCrop>false</ScaleCrop> <HeadingPairs> <vt:vector size="2" baseType="variant"> <vt:variant> <vt:lpstr></vt:lpstr> </vt:variant> <vt:variant> <vt:i4>1</vt:i4> </vt:variant> </vt:vector> </HeadingPairs> <TitlesOfParts> <vt:vector size="1" baseType="lpstr"> <vt:lpstr/> </vt:vector> </TitlesOfParts> <Company/> <LinksUpToDate>false</LinksUpToDate> <CharactersWithSpaces>949</CharactersWithSpaces> <SharedDoc>false</SharedDoc> <HyperlinksChanged>false</HyperlinksChanged> <AppVersion>14.0000</AppVersion> </Properties> 

The OpenXML lib reads wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text from the <Pages>1</Pages> property . These properties are created only by the winword application. if the word document has changed, wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text is not valid. if the word document is programmatically created, then wordDocument.ExtendedFilePropertiesPart is often null.

0
source

List <Item> Allparagraphs = wp.MainDocumentPart.Document.Body.OfType <Abstracts (). ToList ();

List <Paragraph> PageParagraphs = Allparagraphs.Where (x => xDescendants <LastRenderedPageBreak> (). Count () == 1). Select (x => x). Distinct (). ToList ();

-2
source

Source: https://habr.com/ru/post/1258071/


All Articles