Lxml, doctype missing during serialization

In [1]: from lxml import etree

I have an HTML document:

In [2]: root = etree.fromstring(u'''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN">\n<HTML></HTML>''', etree.HTMLParser())

Its doctype parses correctly:

In [3]: root.getroottree().docinfo.doctype
Out[3]: u'<!DOCTYPE html PUBLIC "-//IETF//DTD HTML//EN">'

But when it is serialized, I lose it:

In [4]: etree.tostring(root.getroottree(), method='html')
Out[4]: '<html></html>'

What should I do to get this document serialized?

Debian GNU / Linux, Sid. Python 2.6.6. lxml 2.2.8-2.

+3
source share
2 answers

The error, as mentioned in the comment on another answer: missing doctype during serialization . Fixed in February 2015, which will be released in version 3.5 lxml.

+1
source

The only way I have been able to get it working so far is to use the default XML parser and add a non-empty system URL to the document:

>>> html = etree.parse(StringIO('''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'''))
>>> etree.tostring(html, method="xml")
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML/>'
>>> etree.tostring(html, method="html")
'<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'

HTMLParser docinfo, :

>>> html = etree.parse(StringIO('''<!DOCTYPE HTML PUBLIC "-//IETF//DTD HTML//EN" " ">\n<HTML></HTML>'''), etree.HTMLParser())
>>> etree.tostring(html, method="html")
'<html></html>'
+2

Source: https://habr.com/ru/post/1769185/


All Articles