Convert OOXML formatting to a merged element

In OOXML, formatting, such as bold, italics, etc., can (and often annoyingly) be shared between several elements, for example:

<w:p> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve">This is a </w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve">bold </w:t> </w:r> <w:r> <w:rPr> <w:b/> <w:i/> </w:rPr> <w:t>with a bit of italic</w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve"> </w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>paragr</w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>a</w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>ph</w:t> </w:r> <w:r> <w:t xml:space="preserve"> with some non-bold in it too.</w:t> </w:r> </w:p> 

I need to combine these formatting elements to create this:

 <p><b>This is a mostly bold <i>with a bit of italic</i> paragraph</b> with some non-bold in it too.</p> 

My initial approach was to write the start formatting tag when it first occurs, using:

  <xsl:text disable-output-escaping="yes">&lt;b&gt;</xsl:text> 

Then, after I process each <w:r> , check the following to see if formatting still exists. If it is not, add the end tag in the same way as I add the start tag. I keep thinking that there should be a better way to do this, and I would be grateful for any suggestions.

It should also be mentioned that I am attached to XSLT 1.0.

The reason for this is that we need to compare the XML file before converting it to OOXML and after converting it from OOXML. Additional formatting tags do this as if changes were made when they were not.

+6
source share
4 answers

Here is the complete XSLT 1.0 solution :

 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:ext="http://exslt.org/common" xmlns:w="w" exclude-result-prefixes="ext w"> <xsl:output omit-xml-declaration="yes"/> <xsl:strip-space elements="*"/> <xsl:template match="w:p"> <xsl:variable name="vrtfPass1"> <p> <xsl:apply-templates/> </p> </xsl:variable> <xsl:apply-templates mode="pass2" select="ext:node-set($vrtfPass1)/*"/> </xsl:template> <xsl:template match="w:r"> <xsl:variable name="vrtfProps"> <xsl:for-each select="w:rPr/*"> <xsl:sort select="local-name()"/> <xsl:copy-of select="."/> </xsl:for-each> </xsl:variable> <xsl:call-template name="toHtml"> <xsl:with-param name="pProps" select= "ext:node-set($vrtfProps)/*"/> <xsl:with-param name="pText" select="w:t/text()"/> </xsl:call-template> </xsl:template> <xsl:template name="toHtml"> <xsl:param name="pProps"/> <xsl:param name="pText"/> <xsl:choose> <xsl:when test="not($pProps)"> <xsl:copy-of select="$pText"/> </xsl:when> <xsl:otherwise> <xsl:element name="{local-name($pProps[1])}"> <xsl:call-template name="toHtml"> <xsl:with-param name="pProps" select= "$pProps[position()>1]"/> <xsl:with-param name="pText" select="$pText"/> </xsl:call-template> </xsl:element> </xsl:otherwise> </xsl:choose> </xsl:template> <xsl:template match="/*" mode="pass2"> <xsl:copy> <xsl:copy-of select="@*"/> <xsl:call-template name="processInner"> <xsl:with-param name="pNodes" select="node()"/> </xsl:call-template> </xsl:copy> </xsl:template> <xsl:template name="processInner"> <xsl:param name="pNodes"/> <xsl:variable name="pNode1" select="$pNodes[1]"/> <xsl:if test="$pNode1"> <xsl:choose> <xsl:when test="not($pNode1/self::*)"> <xsl:copy-of select="$pNode1"/> <xsl:call-template name="processInner"> <xsl:with-param name="pNodes" select= "$pNodes[position()>1]"/> </xsl:call-template> </xsl:when> <xsl:otherwise> <xsl:variable name="vbatchLength"> <xsl:call-template name="getBatchLength"> <xsl:with-param name="pNodes" select="$pNodes[position()>1]"/> <xsl:with-param name="pName" select="name($pNode1)"/> <xsl:with-param name="pCount" select="1"/> </xsl:call-template> </xsl:variable> <xsl:element name="{name($pNode1)}"> <xsl:copy-of select="@*"/> <xsl:call-template name="processInner"> <xsl:with-param name="pNodes" select= "$pNodes[not(position()>$vbatchLength)] /node()"/> </xsl:call-template> </xsl:element> <xsl:call-template name="processInner"> <xsl:with-param name="pNodes" select= "$pNodes[position()>$vbatchLength]"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:if> </xsl:template> <xsl:template name="getBatchLength"> <xsl:param name="pNodes"/> <xsl:param name="pName"/> <xsl:param name="pCount"/> <xsl:choose> <xsl:when test= "not($pNodes) or not($pNodes[1]/self::*) or not(name($pNodes[1])=$pName)"> <xsl:value-of select="$pCount"/> </xsl:when> <xsl:otherwise> <xsl:call-template name="getBatchLength"> <xsl:with-param name="pNodes" select= "$pNodes[position()>1]"/> <xsl:with-param name="pName" select="$pName"/> <xsl:with-param name="pCount" select="$pCount+1"/> </xsl:call-template> </xsl:otherwise> </xsl:choose> </xsl:template> </xsl:stylesheet> 

when this conversion is applied to the following XML document (based on the provided, but harder to show, as more edges):

 <w:p xmlns:w="w"> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve">This is a </w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve">bold </w:t> </w:r> <w:r> <w:rPr> <w:b/> <w:i/> </w:rPr> <w:t>with a bit of italic</w:t> </w:r> <w:r> <w:rPr> <w:b/> <w:i/> </w:rPr> <w:t> and some more italic</w:t> </w:r> <w:r> <w:rPr> <w:i/> </w:rPr> <w:t> and just italic, no-bold</w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve"></w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>paragr</w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>a</w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>ph</w:t> </w:r> <w:r> <w:t xml:space="preserve"> with some non-bold in it too.</w:t> </w:r> </w:p> 

required, the correct result is obtained :

 <p><b>This is a bold <i>with a bit of italic and some more italic</i></b><i> and just italic, no-bold</i><b>paragraph</b> with some non-bold in it too.</p> 

Explanation

  • This is a two pass conversion . The first pass is relatively simple and converts the original XML document (in our particular case) into the following:

pass1 result (indented for reading):

 <p> <b>This is a </b> <b>bold </b> <b> <i>with a bit of italic</i> </b> <b> <i> and some more italic</i> </b> <i> and just italic, no-bold</i> <b/> <b>paragr</b> <b>a</b> <b>ph</b> with some non-bold in it too.</p> 

0.2. The second pass (performed in "pass2" mode) combines any batch of sequential and identically named elements into one element with this name. He recursively calls himself child elements of the combined elements - thus, parties merge at any depth.

0.3. Take a note . We do not use (and cannot) the following-sibling:: or preceding-sibling , because only the nodes (which should be combined) at the top level are really brothers and sisters. For this reason, we process all nodes in the same way as node-set.

0.4. This solution is completely general - it combines any sequence of consecutive identically named elements at any depth - and no specific names are hardcoded.

+6
source

This is not a complete solution, but much easier than trying to do this with pure XSLT. Depending on the complexity of your source, this may not be ideal, but it may be worth a try. These templates:

 <xsl:template match="w:p"> <p> <xsl:apply-templates /> </p> </xsl:template> <xsl:template match="w:r[w:rPr/w:b]"> <b> <xsl:apply-templates /> </b> </xsl:template> <xsl:template match="w:r[w:rPr/w:i]"> <i> <xsl:apply-templates /> </i> </xsl:template> <xsl:template match="w:r[w:rPr/w:i and w:rPr/w:b]"> <b> <i> <xsl:apply-templates /> </i> </b> </xsl:template> 

Print <p><b>This is a </b><b>bold </b><b><i>with a bit of italic</i></b><b> </b><b>paragr</b><b>a</b><b>ph</b> with some non-bold in it too.</p>

You can then use simple text manipulation to remove any occurrences </b><b> and </i><i> , leaving you with:

<p><b>This is a bold <i>with a bit of italic</i> paragraph</b> with some non-bold in it too.</p>

+3
source

OOXML is a defined standard that has its own specification . In order to create a general conversion from OOXML to HTML (this is interesting, even if I think that existing implementations on the Internet should exist), you should study at least a little (and you need to learn a little XSLT, I think).

In the general case (as a whole), the content of a WordML document mainly consists of w:p (paragraph) elements containing w:r run (a region of text with the same properties). Inside each run, you can usually find the text properties of the area ( w:rPr ) and the text itself ( w:t ).

The model is much more complex, but you can start working on this general structure.

For example, you can start working with the following (bit) general conversion. Note that it only manages paragraphs with bold, italics, and unselected text.


XSLT 2.0 is tested in Saxon-HE 9.2.1.1J

 <xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" exclude-result-prefixes="w"> <xsl:output method="html"/> <xsl:strip-space elements="*"/> <xsl:template match="w:document/w:body"> <html> <body> <xsl:apply-templates select="w:p"/> </body> </html> </xsl:template> <!-- match paragraph --> <xsl:template match="w:p"> <p> <xsl:apply-templates select="w:r"/> </p> </xsl:template> <!-- match run with property --> <xsl:template match="w:r[w:rPr]"> <xsl:apply-templates select="w:rPr/*[1]"/> </xsl:template> <!-- Recursive template for bold, italic and underline properties applied to the same run. Escape to paragraph text --> <xsl:template match="w:b | w:i | w:u"> <xsl:element name="{local-name(.)}"> <xsl:choose> <!-- recurse to next sibling property i, b or u --> <xsl:when test="count(following-sibling::*[1])=1"> <xsl:apply-templates select="following-sibling::* [local-name(.)='i' or local-name(.)='b' or local-name(.)='u']"/> </xsl:when> <xsl:otherwise> <!-- escape to text --> <xsl:apply-templates select="parent::w:rPr/ following-sibling::w:t"/> </xsl:otherwise> </xsl:choose> </xsl:element> </xsl:template> <!-- match run without property --> <xsl:template match="w:r[not(w:rPr)]"> <xsl:apply-templates select="w:t"/> </xsl:template> <!-- match text --> <xsl:template match="w:t"> <xsl:value-of select="."/> </xsl:template> </xsl:stylesheet> 

Applicable:

 <w:document xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml"> <w:body> <w:p> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve">This is a </w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve">bold </w:t> </w:r> <w:r> <w:rPr> <w:b/> <w:i/> </w:rPr> <w:t>with a bit of italic</w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t xml:space="preserve"> </w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>paragr</w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>a</w:t> </w:r> <w:r> <w:rPr> <w:b/> </w:rPr> <w:t>ph</w:t> </w:r> <w:r> <w:t xml:space="preserve"> with some non-bold in it too.</w:t> </w:r> </w:p> </w:body> </w:document> 

gives:

 <html> <body> <p><b>This is a </b><b>bold </b><b><i>with a bit of italic</i></b><b> </b><b>paragr</b><b>a</b><b>ph</b> with some non-bold in it too. </p> </body> </html> 

A side effect of having grotesque HTML is inevitable due to the WordML underlay scheme. Perhaps the task of making the final HTML very readable can be deferred to some user-friendly (and powerful) utility, such as HTML tidy .

+3
source

Another approach, similar to Flynn, but remaining with XSLT instead of adding a separate text processing layer, was to convert the original HTML output to the same stylesheet to collapse adjacent <b> or <i> elements into separate elements.

In other words, the stylesheet will first generate the original HTML result tree, and then pass it as an input to the set of templates (using special mode) that performed the collapse operation.

Updated: Here is a working, 2-step style sheet built on the style of the 1st level styles @empo:

 <?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" xmlns:xs="http://www.w3.org/2001/XMLSchema" exclude-result-prefixes="xs w" xmlns:w="http://schemas.microsoft.com/office/word/2003/wordml" version="2.0"> <xsl:output method="html"/> <xsl:strip-space elements="*"/> <xsl:variable name="collapsibles" select="('i', 'b', 'u')"/> <!-- identity template, except we collapse any adjacent b or i child elements. --> <xsl:template match="*" mode="collapse-adjacent"> <xsl:copy> <xsl:copy-of select="@*"/> <xsl:for-each select="node()"> <xsl:choose> <xsl:when test="index-of($collapsibles, local-name()) and not(name(preceding-sibling::node()[1]) = name())"> <xsl:copy> <xsl:copy-of select="@*"/> <xsl:call-template name="process-niblings"/> </xsl:copy> </xsl:when> <xsl:when test="index-of($collapsibles, local-name())"/> <!-- do not copy --> <xsl:otherwise> <xsl:copy> <xsl:copy-of select="@*"/> <xsl:apply-templates mode="collapse-adjacent"/> </xsl:copy> </xsl:otherwise> </xsl:choose> </xsl:for-each> </xsl:copy> </xsl:template> <!-- apply templates to children of current element *and* of all consecutively following elements of the same name. --> <xsl:template name="process-niblings"> <xsl:apply-templates mode="collapse-adjacent"/> <!-- If immediate following sibling is the same element type, recurse with context node set to that sibling. --> <xsl:for-each select="following-sibling::node()[1][name() = name(current())]"> <xsl:call-template name="process-niblings"/> </xsl:for-each> </xsl:template> <!-- @empo stylesheet (modified) follows. --> <xsl:template match="/"> <html> <body> <xsl:variable name="raw-html"> <xsl:apply-templates /> </xsl:variable> <xsl:apply-templates select="$raw-html" mode="collapse-adjacent"/> </body> </html> </xsl:template> <xsl:template match="w:document | w:body"> <xsl:apply-templates /> </xsl:template> <!-- match paragraph --> <xsl:template match="w:p"> <p> <xsl:apply-templates select="w:r"/> </p> </xsl:template> <!-- match run with property --> <xsl:template match="w:r[w:rPr]"> <xsl:apply-templates select="w:rPr/*[1]"/> </xsl:template> <!-- Recursive template for bold, italic and underline properties applied to the same run. Escape to paragraph text --> <xsl:template match="w:b | w:i | w:u"> <xsl:element name="{local-name(.)}"> <xsl:choose> <!-- recurse to next sibling property i, b or u --> <xsl:when test="count(following-sibling::*[1])=1"> <xsl:apply-templates select="following-sibling::* [index-of($collapsibles, local-name(.))]"/> </xsl:when> <xsl:otherwise> <!-- escape to text --> <xsl:apply-templates select="parent::w:rPr/ following-sibling::w:t"/> </xsl:otherwise> </xsl:choose> </xsl:element> </xsl:template> <!-- match run without property --> <xsl:template match="w:r[not(w:rPr)]"> <xsl:apply-templates select="w:t"/> </xsl:template> <!-- match text --> <xsl:template match="w:t"> <xsl:value-of select="."/> </xsl:template> </xsl:stylesheet> 

When you re-test the sample you entered, the above stylesheet gives

 <html> <body> <p><b>This is a bold <i>with a bit of italic</i> paragraph</b> with some non-bold in it too. </p> </body> </html> 

which looks the way you like.

+3
source

Source: https://habr.com/ru/post/890175/


All Articles