Difference between lxml and html5lib in beautifulsoup context

Question

Difference between lxml and html5lib in beautifulsoup context

Is there any difference between the capabilities of lxml and html5lib parsers in the context of beautifulsoup? I am trying to learn how to use BS4 and use the following code construct -

ret = requests.get('http://www.olivegarden.com') soup = BeautifulSoup(ret.text, 'html5lib') for item in soup.find_all('a'): print item['href']

I started by using lxml as a parser, but noticed that for some sites the for loop is never entered, although the page has valid links. The same page works with the html5ib parser. Are there any specific page types that may not work with lxml?

I'm on Ubuntu using python-lxml 2.3.2-1 with libxml2 2.7.8.dfsg-5.1ubunt and html5lib-1.0b3

EDIT: Updated to lxml 3.1.2 and still see the same problem. On mac, although 3.0.x works, the page handles correctly. This website is www.olivegarden.com

+4

python lxml beautifulsoup html5lib

R11 Sep 03 '13 at 0:44

source share

1 answer

gsnedders · Accepted Answer · 2013-09-04T17:11:16+0000

html5lib uses the HTML parsing algorithm as defined in the HTML specification and is implemented in all major browsers. lxml uses the parser of the libxml2 HTML parser - it is based on their XML parser and does not follow any error handling for invalid HTML used elsewhere.

Most web developers only test web browsers - the standards will be damned - so if you want to get what the page author intended, you will most likely need to use something like html5lib that matches current browsers,

Difference between lxml and html5lib in beautifulsoup context

More articles: