Lxml.html.tostring reorders doctype and xml tags when printing

Imagine I have a test.html file with content,

<?xml version="1.0" encoding="UTF-8" standalone="no"?><!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd"><html xmlns="http://www.w3.org/1999/xhtml"><head><title>Components of the SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.74.0"/></head><body></body></html> 

And by doing this at the python prompt,

 >>>import lxml.html >>>t = lxml.html.parse('test.html') >>>lxml.html.etree.tostring(t) >>>'<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.1//EN" "http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd">\n<?xml version="1.0" encoding="UTF-8" standalone="no"??><html xmlns="http://www.w3.org/1999/xhtml"><head><title>Components of the SDK</title><link rel="stylesheet" href="core.css" type="text/css"/><meta name="generator" content="DocBook XSL Stylesheets V1.74.0"/></head><body/></html>' 

Notice how doctype and xml tags swap after reading lxml in the data and then print it again through tostring? How can we fix this so that he does not try to change the document (assuming that it is well-formed).

+4
source share
1 answer

Short answer

Do this instead (assuming your documents are all well-formed XML)

 etx = lxml.etree.parse('test.html') print lxml.etree.tostring(etx, xml_declaration=True, encoding=etx.docinfo.encoding, standalone=etx.docinfo.standalone) 

Description

test.html is not really valid html. It contains empty elements and an XML processing instruction. They are not understood by html. The html parser interprets the XML processing instruction as an SGML processing instruction (they are similar to <? ... > instead of xml <? ... ?> ) With the contents xml version="1.0" encoding="UTF-8" standalone="no"? . Thus, when reserving as XML, the XML processing instruction has double questions, for example: ??>

Your results with the parser or html5lib serializer html5lib slightly better - if you reinitialize the XML, the processing instruction will be in the comments. This is because HTML5 also does not allow SGML processing instructions, and interprets the xml preamble as garbage text to ignore.

To get the results you want, instead analyze and organize your document using the xml parser ( lxml.etree ). It seems to be well-formed xml and valid XHTML1.1. If instead you serialize the html serializer ( lxml.html.tostring() , not lxml.html.etree.tostring() ), it will output a polyglot xhtml document.

The wrinkle is that the serializer is not trying to accurately save the xml declaration (this, after all, is not part of the xml-infoset). You need to pass them to the tostring() method from the docinfo property.

+6
source

Source: https://habr.com/ru/post/1386432/


All Articles