Is there any way to parse invalid HTML?

I need to parse invalid HTML files containing multiple random elements (e.g. BODY) in random lines throughout the file. I tried parsing it as XML, but no luck, as this file also has an invalid XML structure (many invalid attributes in random elements above the file). HtmlAgilityPack was also unable to read this file. This is only reading the file before the first incorrect element and nothing after it.

Here is a small example of such a file:

<HTML> <HEAD> <TITLE>My title</TITLE> </HEAD> <BODY leftmargin=9 topmargin=7 > <TABLE> <TR> <TD>Test</TD> </TR> <TR> <TD>Test</TD> <TD>Test<TD> </TR> <BODY> <-- This is the point where HtmlAgilityPack is stuck --!> <TR> <TD>Test</TD> <TD>Test</TD> </TR> <TR> </BODY> <TR> <TD><FONT>Test</FONT></TD> </TR> </TABLE> </BODY> 

I am trying to parse the information from this table.

+6
source share
3 answers

Let Internet Explorer do the hard work for you - it will do everything possible to “restore” the structure of the broken tag to something that it understands (which is technically sound XML with the right combination of tags, etc.).

Open HTML WebBrowser (or Windows.Controls.WebBrowser if you prefer WPF libraries), you can go through the DOM through the Document property. The DOM will always be correct, regardless of how the original source broke.

No third-party libraries required.

+4
source

We analyzed web pages with invalid html with the Html Agility Pack . As far as I remember, this is a good job.

+3
source

You can use SgmlReader . Of course, if your html files are very wrong, this will not help you.

0
source

Source: https://habr.com/ru/post/898966/


All Articles