HtmlAgilityPack: Can someone explain what exactly is the result of setting the HtmlDocument OptionAutoCloseOnEnd to true?

Current documentation says:

Determines whether to close closed nodes at the end or directly in the document. Setting this value to true can actually change the way browsers render the page. The default value is false.

Sorry, I have to admit that I do not understand this paragraph. In particular, "at the end" of what? And what does "in the document" mean? The phrase sounds ominous to the last. If the parameter is set to true, and if the html is formatted correctly, will it still affect the document?

I looked in the source code, but I did not understand what was happening - the code reacts to the fact that the property is not set to true. See HtmlNode.cs and search for OptionAutoCloseOnEnd - line 1707. I also found some funky code in HtmlWeb.cs on lines 1113 and 1154. It is too bad that the source browser does not display line numbers, but searches for OptionAutoCloseOnEnd on the page.

Could you illustrate with an example what this option does?

I use HtmlAgilityPack to fix bad html and to export page content to xml.

I came across some poorly formatted html overlapping tags. Here is a snippet:

<p>Blah bah <P><STRONG>Some Text</STRONG><STRONG></p> <UL> <LI></STRONG>Item 1.</LI> <LI>Item 2</LI> <LI>Item 3</LI></UL> 

Note that the first p tag is not closed and note the overlapping STRONG tag.

If I installed OptionAutoCloseOnEnd, it will be fixed somehow. I am trying to understand what exactly is the effect of setting this property as a whole in the structure of the document.

Here is the C # code I'm using:

 HtmlDocument doc = new HtmlDocument(); doc.OptionOutputAsXml = true; doc.OptionFixNestedTags = true; // doc.OptionAutoCloseOnEnd = true; doc.LoadHtml(htmlText); 

Thanks!

+6
source share
2 answers

Current code always closes unblocked nodes just before closing the parent node. So the following code

 var doc = new HtmlDocument(); doc.LoadHtml("<x>hello<y>world</x>"); doc.Save(Console.Out); 

will output this (unclosed <y> closed before the parent <x> is closed)

 <x>hello<y>world</y></x> 

Initially, the option, when it was installed, was intended for its creation (and not for XML output types):

 <x>hello<y>world</x></y> 

with a closing <y> set at the end of the document (which means "end"). Please note that in this case, you can still get overlapping elements.

This function (perhaps useless, I can admit that) was broken somewhere in the past, I don’t know why.

Note The tag tag is special because it is controlled by the custom HtmlElementFlag by default. So declared in HtmlNode.cs:

 ElementsFlags.Add("p", HtmlElementFlag.Empty | HtmlElementFlag.Closed); 
+3
source

The best way to use HtmlAgilityPack would be to open and close tags programmatically wherever needed and for installation:

  doc.OptionAutoCloseOnEnd = false; 

This will give you the expected formatting.

Otherwise, the library will check for any closed tags and close them where convenient, in accordance with the code execution flow.

+1
source

Source: https://habr.com/ru/post/1011965/


All Articles