Short answer
Do this instead (assuming your documents are all well-formed XML)
etx = lxml.etree.parse('test.html') print lxml.etree.tostring(etx, xml_declaration=True, encoding=etx.docinfo.encoding, standalone=etx.docinfo.standalone)
Description
test.html is not really valid html. It contains empty elements and an XML processing instruction. They are not understood by html. The html parser interprets the XML processing instruction as an SGML processing instruction (they are similar to <? ... > instead of xml <? ... ?> ) With the contents xml version="1.0" encoding="UTF-8" standalone="no"? . Thus, when reserving as XML, the XML processing instruction has double questions, for example: ??>
Your results with the parser or html5lib serializer html5lib slightly better - if you reinitialize the XML, the processing instruction will be in the comments. This is because HTML5 also does not allow SGML processing instructions, and interprets the xml preamble as garbage text to ignore.
To get the results you want, instead analyze and organize your document using the xml parser ( lxml.etree ). It seems to be well-formed xml and valid XHTML1.1. If instead you serialize the html serializer ( lxml.html.tostring() , not lxml.html.etree.tostring() ), it will output a polyglot xhtml document.
The wrinkle is that the serializer is not trying to accurately save the xml declaration (this, after all, is not part of the xml-infoset). You need to pass them to the tostring() method from the docinfo property.
source share