Convert HTML with XQuery

I want to take the HTML generated by the QTextEdit editor and convert it to something more convenient to use on a real web page. Unfortunately, the HTML generator, which is part of the QTextEdit api, is not publicly available and cannot be modified. I would prefer not to create a WYSIWYG HTML editor when I have most of what I need.

In a short discussion of the qt-interest mailing list, someone mentioned using XQuery through the QtXmlPatterns module.

For the ugly HTML example that the editor displays, it uses <span style=" font-weight:600"> for bold text, <span style=" font-weight:600; text-decoration: underline"> for bold and underline text etc. Here is an example:

 <html> <head> </head> <body style=" font-family:'Lucida Grande'; font-size:14pt; font-weight:400; font-style:normal;"> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text</p> <p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;"></p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span style=" font-weight:600;">bold text</span></p> <p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; font-weight:600;"></p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span style=" font-style:italic;">italics text</span></p> <p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; font-style:italic;"></p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span style=" text-decoration: underline;">underline text</span></p> <p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;"></p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span style=" font-weight:600; text-decoration: underline;">bold underline text</span></p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span style=" font-weight:600;">bold text </span><span style=" font-weight:600; text-decoration: underline;">bold underline text</span></p> </body> </html> 

What I would like to convert to this is something like this:

 <body> <p>plain text</p> <p/> <p>plain text <b>bold text</b></p> <p/> <p>plain text <em>italics text</em></p> <p/> <p>plain text <u>underline text</u></p> <p/> <p>plain text <b>bold text <u>bold underline text</u></b></p> </body> 

I got about 90% of the way to where I need to. I can correctly convert the first 4, where each element of the <span> style has only one of italics, bold or underlining attributes. I am having problems when the span style has several attributes. For example, if the span style has both font-weight:600 and text-decoration: underline .

Here is my XQuery code that I still have:

 declare function local:process_span_data($node as node()) { for $n in $node return ( for $attr in $n/@style return ( if(contains($attr, 'font-weight:600')) then ( <b>{data($n)}</b> ) else if(contains($attr, 'text-decoration: underline')) then ( <u>{data($n)}</u> ) else if (contains($attr, 'font-style:italic')) then ( <em>{data($n)}</em> ) else ( data($n) ) ) ) }; declare function local:process_p_data($data as node()+) { for $d in $data return ( if ($d instance of text()) then $d else local:process_span_data($d) ) }; let $doc := doc('myfile.html') for $body in $doc/html/body return <body> { for $p in $body/p return ( if (contains($p/@style, '-qt-paragraph-type:empty;')) then ( <p /> ) else ( if (count($p/*) = 0) then ( <p>{data($p)}</p> ) else ( <p> {for $data in $p/node() return local:process_p_data($data)} </p> ) ) ) }</body> 

Which gives ALMOST the correct result:

 <body> <p>plain text</p> <p/> <p>plain text <b>bold text</b> </p> <p/> <p>plain text <em>italics text</em> </p> <p/> <p>plain text <u>underline text</u> </p> <p/> <p>plain text <b>bold underline text</b> </p> <p>plain text <b>bold text </b> <b>bold underline text</b> <!-- NOT UNDERLINED!! --> </p> </body> 

Can someone point me in the right direction to achieve my desired result? Thanks in advance from XQuery n00b!

+4
source share
2 answers

your approach is correct, but the XQuery transformation logic is a bit of a non-functional paradigm approach.

check this.

 xquery version '1.0-ml'; declare namespace mittai = "mittai"; declare function mittai:parse-thru($n as node()) { for $z in $n/node() return mittai:dispatch($z) }; declare function mittai:dispatch($n as node()) { typeswitch($n) case text() return $n case element(p) return element{ fn:node-name($n) } {mittai:parse-thru($n)} case element(span) return element{ fn:node-name($n) } {mittai:parse-thru($n)} case element(body) return element{ fn:node-name($n) } {mittai:parse-thru($n)} default return element{ fn:node-name($n) } {$n/@*, mittai:parse-thru($n)} }; let $d := doc('myfile.html') return <html> {mittai:parse-thru($d)} </html> 
+1
source

This XQuery (using a common identification function):

 declare variable $Prop as element()* := (<prop name="em">font-style:italic</prop>, <prop name="strong">font-weight:600</prop>, <prop name="u">text-decoration:underline</prop>); declare function local:copy($element as element()) { element {node-name($element)} {$element/@*, for $child in $element/node() return if ($child instance of element()) then local:match($child) else $child } }; declare function local:match($element as element()) { if ($element/self::span[@style]) then local:replace($element) else local:copy($element) }; declare function local:replace($element as element()) { let $prop := local:parse($element/@style) let $no-match := $prop[not(.=$Prop)] return element {node-name($element)} {$element/@* except $element/@style, if (exists($no-match)) then attribute style {string-join($no-match,';')} else (), local:nested($Prop[.=$prop]/@name,$element)} }; declare function local:parse($string as xs:string) { for $property in tokenize($string,';')[.] return <prop>{ replace(normalize-space($property),'( )?:( )?',':') }</prop> }; declare function local:nested($names as xs:string*, $element as element()) { if (exists($names)) then element {$names[1]} {local:nested($names[position()>1],$element)} else for $child in $element/node() return if ($child instance of element()) then local:match($child) else $child }; local:match(*) 

Output:

 <html> <head> </head> <body style=" font-family:'Lucida Grande'; font-size:14pt; font-weight:400; font-style:normal;"> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text</p> <p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;"/> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span> <strong>bold text</strong> </span> </p> <p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; font-weight:600;"/> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span> <em>italics text</em> </span> </p> <p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px; font-style:italic;"/> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span> <u>underline text</u> </span> </p> <p style="-qt-paragraph-type:empty; margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;"/> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span> <strong> <u>bold underline text</u> </strong> </span> </p> <p style=" margin-top:0px; margin-bottom:0px; margin-left:0px; margin-right:0px; -qt-block-indent:0; text-indent:0px;">plain text <span> <strong>bold text </strong> </span> <span> <strong> <u>bold underline text</u> </strong> </span> </p> </body> </html> 
0
source

Source: https://habr.com/ru/post/1338351/


All Articles