Lax HTML analysis in C ++?

I am looking for a solution to parse potentially malformed HTML in C ++, similar to what Beautiful Soup in Python does.

The XML parser usually just works, but the specific HTML in this case is invalid XML / XHTML and cannot be parsed correctly.

Are there libraries / tools for this?

+4
source share
3 answers

You can use HTMLTidy to convert HTML to valid XML and then use any C ++ XML parser

+6
source

According to the documentation, LibXml2 is able to parse HTML4.

+2
source

I used Xerces and recommend it for C ++. It has both a DOM and a SAX model.

-1
source

Source: https://habr.com/ru/post/1336823/


All Articles