Lxml removes spaces and line breaks in <head>

This small program:

from lxml.html import tostring, fromstring e = fromstring(''' <html><head> <link href="/comments.css" rel="stylesheet" type="text/css"> <link href="/index.css" rel="stylesheet" type="text/css"> </head> <body> <span></span> <span></span> </body> </html>''') print (tostring(e, encoding=str)) #unicode on python 2 

will print:

 <html><head><link href="/comments.css" rel="stylesheet" type="text/css"><link href="/index.css" rel="stylesheet" type="text/css"></head><body> <span></span> <span></span> </body></html> 

Gaps and lines in the head are removed. This happens even if we put two <link> elements in the <body>. It seems the empty text nodes (\ s *) between the head elements are removed.

How can I keep spaces and line breaks between <link> s? (I expect the output to be the same as the input)

+6
source share
2 answers

Finally, I used html5lib to parse html and generate lxml, like a tree with it.

parser = html5lib.HTMLParser(tree=html5lib.getTreeBuilder("lxml"), namespaceHTMLElements=False)

+1
source

for me

print (tostring(e, encoding=str))

returns

 >>> print (tostring(e, encoding=str)) Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/usr/lib/python2.7/dist-packages/lxml/html/__init__.py", line 1493, in tostring encoding=encoding) File "lxml.etree.pyx", line 2836, in lxml.etree.tostring (src/lxml/lxml.etree.c:53416) TypeError: descriptor 'upper' of 'str' object needs an argument 

I cannot speak with descrepencey, but I suggest setting the pretty_print argument to true

 >>> etree.tostring(e, pretty_print=True) '<html>\n <head>\n <link href="/comments.css" rel="stylesheet" type="text/css"/>\n <link href="/index.css" rel="stylesheet" type="text/css"/>\n </head>\n <body>\n <span/>\n <span/>\n </body>\n</html>\n' 

you will need to import etree from lxml import etree

when outputting to outfile, spaces and newlines will be preserved. Also with print

 >>> print(etree.tostring(e, pretty_print=True)) <html> <head> <link href="/comments.css" rel="stylesheet" type="text/css"/> <link href="/index.css" rel="stylesheet" type="text/css"/> </head> <body> <span/> <span/> </body> </html> 

I'm sure you checked the API , but if you don't have information on tostring () . It is also safe to assume that you saw the tutorial on the lxml website. I would like to see some more “good” resources. I am new to lxml myself, and something new and good to read will be welcome.

Updated

you said you want to leave sed if you can't find a good python solution.

this should be done with sed

sed -i '1,2d;' input.html; sed -i '1 i\<html><head>' input.html

two sed procedures are executed. the first deletes the first 2 lines. second insertion <html><head> in the first line.

UPDATE # 2

I should have thought about this more. you can do it with python

  >>> import re >>> newString = re.sub('\n ', '', etree.tostring(e,encoding=unicode,pretty_print=True), count=1) >>> print(newString) <html><head> <link href="/comments.css" rel="stylesheet" type="text/css"/> <link href="/index.css" rel="stylesheet" type="text/css"/> </head> <body> <span/> <span/> </body> </html> 
+2
source

Source: https://habr.com/ru/post/891296/


All Articles