How to remove minor spaces in lxml.html?

Question

How to remove minor spaces in lxml.html?

I'm pretty surprised that lxml.html leaves minor spaces when parsing HTML by default. I am also surprised that I cannot find an obvious way to do this wrong.

Python 2.7.3 (default, Apr 10 2013, 06:20:15) [GCC 4.6.3] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import lxml.etree >>> parser = lxml.etree.HTMLParser(remove_blank_text=True) >>> html = lxml.etree.HTML("<p> Hello World </p>", parser=parser) >>> print lxml.etree.tostring(html) <html><body><p> Hello World </p></body></html>

I expect the result to be something like this:

 >>> print lxml.etree.tostring(html) <html><body><p>Hello World</p></body></html>

BeautifulSoup4 does the same with the html5lib parser:

 >>> from bs4 import BeautifulSoup >>> soup = BeautifulSoup("<p> Hello World </p>", "html5lib") >>> soup.p <p> Hello World </p>

After some research, I found that the HTML5 parsing specification does not indicate the removal of consecutive spaces; this is done at the time of rendering. Therefore, I understand that, technically, it is not the responsibility of any of these libraries to perform the same behavior, but it seems useful enough that I am surprised that none of them have it anyway.

Can someone prove to me what is wrong?

Edit:

I know how to remove a space using a regular expression - this is not my question. (I also know how to look for SO for regex questions.)

My question is about a minor gap, where the meaning is determined by the standards for HTML processing. I doubt that a single-mode regular expression can correctly implement this standard. And let’s not even go into the discussion of regular expressions and CFG, please?

Open RegEx tags except XHTML tags contained offline

Edit 2:

If this is not clear from the context, I'm interested in HTML, not XHTML / XML. White has some nontrivial rules of significance in HTML, but these rules are implemented in the renderer, and not in the parser. I understand this, as evidenced by my initial post. My question is, has anyone implemented the white space logic of an HTML renderer in a library that works at the DOM level, and not at the rendering level?

+4

python html-parsing lxml.html

Mark E. Haase Aug 29 '13 at 4:40

source share

2 answers

Ok You would like to discover some gaps and avoid unnecessary ones.

You can do this with reg-exp.

 from re import sub sub(r"(\s)+",' ',yourstring)

it will replace all adjacent spaces (when multiple) with one and only one of them

 '<p> Hello World </p>'

was my result with this.

I believe that it is close enough to your expectations, and single spaces are always better for readability than not.

With a slightly longer regexp, you can avoid spaces adjacent to HTML tags.

-2

Wjars Aug 29 '13 at 5:37

source share

Ivan Chaer · Accepted Answer · 2016-03-17T11:04:41+0000

I came across this library .

Can be installed using pip:

 pip install htmlmin

It is used as:

 from htmlmin import minify html=u"<html><body><p> Hello World </p></body></html>" minified_html = minify(html) print minified_html

What returns:

 <html><body><p> Hello World </p></body></html>

I thought this would do what you were looking for, but as you can see, some non-local spaces have been saved.

How to remove minor spaces in lxml.html?

More articles: