Is there any way to parse invalid HTML?

Question

Is there any way to parse invalid HTML?

I need to parse invalid HTML files containing multiple random elements (e.g. BODY) in random lines throughout the file. I tried parsing it as XML, but no luck, as this file also has an invalid XML structure (many invalid attributes in random elements above the file). HtmlAgilityPack was also unable to read this file. This is only reading the file before the first incorrect element and nothing after it.

Here is a small example of such a file:

<HTML> <HEAD> <TITLE>My title</TITLE> </HEAD> <BODY leftmargin=9 topmargin=7 > <TABLE> <TR> <TD>Test</TD> </TR> <TR> <TD>Test</TD> <TD>Test<TD> </TR> <BODY> <-- This is the point where HtmlAgilityPack is stuck --!> <TR> <TD>Test</TD> <TD>Test</TD> </TR> <TR> </BODY> <TR> <TD><FONT>Test</FONT></TD> </TR> </TABLE> </BODY>

I am trying to parse the information from this table.

+6

c # xml .net

Jcf Oct 10 '11 at 12:27

source share

3 answers

We analyzed web pages with invalid html with the Html Agility Pack . As far as I remember, this is a good job.

+3

Eugeniu torica Oct 10 '11 at 12:42

source share

You can use SgmlReader . Of course, if your html files are very wrong, this will not help you.

0

Łukasz Wiatrak Oct 10 '11 at 12:33

source share

Matěj Zábský · Accepted Answer · 2011-10-10T12:35:33+0000

Let Internet Explorer do the hard work for you - it will do everything possible to “restore” the structure of the broken tag to something that it understands (which is technically sound XML with the right combination of tags, etc.).

Open HTML WebBrowser (or Windows.Controls.WebBrowser if you prefer WPF libraries), you can go through the DOM through the Document property. The DOM will always be correct, regardless of how the original source broke.

No third-party libraries required.

Is there any way to parse invalid HTML?

More articles: