Can someone explain this bit of HtmlAgilityPack code?

Question

Can someone explain this bit of HtmlAgilityPack code?

I tried my best to add comments through the code, but Im kind of stuck in certain areas.

// create a new instance of the HtmlDocument Class called doc 1: HtmlDocument doc = new HtmlDocument(); // the Load method is called here to load the variable result which is html // formatted into a string in a previous code snippet 2: doc.Load(new StringReader(result)); // a new variable called root with datatype HtmlNode is created here. // Im not sure what doc.DocumentNode refers to? 3: HtmlNode root = doc.DocumentNode; 4: // a list is getting constructed here. I haven't had much experience // with constructing lists yet 5: List<string> anchorTags = new List<string>(); 6: // a foreach loop is used to loop through the html document to // extract html with 'a' attributes I think.. 7: foreach (HtmlNode link in root.SelectNodes("//a")) 8: { // dont really know whats going on here 9: string att = link.OuterHtml; // dont really know whats going on here too 10: anchorTags.Add(att) 11: }

I shot this sample code from here . Credit Farooq Kaiser

+4

c # web-scraping html-agility-pack

super9 Dec 21 '10 at 15:20

source share

2 answers

The key is the SelectNodes method. This part used XPath to grab a list of nodes from HTML matching your query.

Here I found out my XPath: http://www.w3schools.com/xpath/default.asp

He then simply scans for the nodes that match and receives OuterHTML - full HTML, including tags, and adds them to the list of strings. The list is basically an array, but more flexible. If you only need the content and not the tags you are using, you should use HtmlNode.InnerHTML or HtmlNode.InnerText.

+5

LoveMeSomeCode Dec 21 '10 at 15:35

source share

Simon mourier · Accepted Answer · 2010-12-21T18:32:24+0000

In terms of HTML Agility Pack, "// a" means "Find all tags with the name" a "or" A "anywhere in the document." See XPATH Docs for more general XPATH help (regardless of HTML flexibility package). Therefore, if the document is as follows:

 <div> <A href="xxx">anchor 1</a> <table ...> <a href="zzz">anchor 2</A> </table> </div>

You will get two HTML anchor elements. OuterHtml represents the HTML node, including the node itself, while InnerHtml represents only the HTML content of the node. So, here are two OuterHtml:

  <A href="xxx">anchor 1</a>

and

 <a href="zzz">anchor 2</A>

Note. I specified "a" or "A" because the HAP implementation is careful or case insensitive to HTML. And "// A" dos does not work by default. You need to specify tags using lowercase letters.

Can someone explain this bit of HtmlAgilityPack code?

More articles: