I read a lot of good things about BeautifulSoup, so I'm trying to use it now to clean up a collection of poorly formed HTML websites.
Unfortunately, there is one feature of BeautifulSoup that pretty much is showstopper:
It seems that when BeautifulSoup encounters a closing tag (in my case </p>) that has never been opened, it decides to end the document. In addition, the method finddoes not appear to search for the content behind the (self-induced) tag </html>in this case. This means that when the block of interest is behind a false closing tag, I cannot access the content.
Is there a way that I can configure BeautifulSoup to ignore inconsistent closing tags and not close the document when they occur?
source
share