BeautifulSoup: how to ignore false end tags

I read a lot of good things about BeautifulSoup, so I'm trying to use it now to clean up a collection of poorly formed HTML websites.

Unfortunately, there is one feature of BeautifulSoup that pretty much is showstopper:

It seems that when BeautifulSoup encounters a closing tag (in my case </p>) that has never been opened, it decides to end the document. In addition, the method finddoes not appear to search for the content behind the (self-induced) tag </html>in this case. This means that when the block of interest is behind a false closing tag, I cannot access the content.

Is there a way that I can configure BeautifulSoup to ignore inconsistent closing tags and not close the document when they occur?

+4
source share
1 answer

BeautifulSoup does not perform any analysis; it uses the output of the selected parser ( lxmlor html.parseror html5lib).

Choose a different parser if the one you are using right now does not handle the broken HTML the way you want. lxml- This is a faster parser and can perfectly cope with broken HTML code, html5libclosest to how your browser will parse broken HTML code, but much slower.

. BeautifulSoup, .

+5

Source: https://habr.com/ru/post/1620871/


All Articles