BeautifulSoup: how to ignore false end tags

Question

BeautifulSoup: how to ignore false end tags

I read a lot of good things about BeautifulSoup, so I'm trying to use it now to clean up a collection of poorly formed HTML websites.

Unfortunately, there is one feature of BeautifulSoup that pretty much is showstopper:

It seems that when BeautifulSoup encounters a closing tag (in my case </p>) that has never been opened, it decides to end the document. In addition, the method finddoes not appear to search for the content behind the (self-induced) tag </html>in this case. This means that when the block of interest is behind a false closing tag, I cannot access the content.

Is there a way that I can configure BeautifulSoup to ignore inconsistent closing tags and not close the document when they occur?

+4

python html python-3.x beautifulsoup

carsten Dec 19 '15 at 12:08

source share

1 answer

Martijn Pieters · Answer 1 · 2015-12-19T12:11:24+0000

BeautifulSoup does not perform any analysis; it uses the output of the selected parser ( lxmlor html.parseror html5lib).

Choose a different parser if the one you are using right now does not handle the broken HTML the way you want. lxml- This is a faster parser and can perfectly cope with broken HTML code, html5libclosest to how your browser will parse broken HTML code, but much slower.

. BeautifulSoup, .

BeautifulSoup: how to ignore false end tags

More articles: