I'm pretty surprised that lxml.html leaves minor spaces when parsing HTML by default. I am also surprised that I cannot find an obvious way to do this wrong.
Python 2.7.3 (default, Apr 10 2013, 06:20:15) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import lxml.etree >>> parser = lxml.etree.HTMLParser(remove_blank_text=True) >>> html = lxml.etree.HTML("<p> Hello World </p>", parser=parser) >>> print lxml.etree.tostring(html) <html><body><p> Hello World </p></body></html>
I expect the result to be something like this:
>>> print lxml.etree.tostring(html) <html><body><p>Hello World</p></body></html>
BeautifulSoup4 does the same with the html5lib parser:
>>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("<p> Hello World </p>", "html5lib") >>> soup.p <p> Hello World </p>
After some research, I found that the HTML5 parsing specification does not indicate the removal of consecutive spaces; this is done at the time of rendering. Therefore, I understand that, technically, it is not the responsibility of any of these libraries to perform the same behavior, but it seems useful enough that I am surprised that none of them have it anyway.
Can someone prove to me what is wrong?
Edit:
I know how to remove a space using a regular expression - this is not my question. (I also know how to look for SO for regex questions.)
My question is about a minor gap, where the meaning is determined by the standards for HTML processing. I doubt that a single-mode regular expression can correctly implement this standard. And letβs not even go into the discussion of regular expressions and CFG, please?
Open RegEx tags except XHTML tags contained offline
Edit 2:
If this is not clear from the context, I'm interested in HTML, not XHTML / XML. White has some nontrivial rules of significance in HTML, but these rules are implemented in the renderer, and not in the parser. I understand this, as evidenced by my initial post. My question is, has anyone implemented the white space logic of an HTML renderer in a library that works at the DOM level, and not at the rendering level?