How can I get html between 2 surrounding html elements using htmlagilitypack?

Question

How can I get html between 2 surrounding html elements using htmlagilitypack?

I need to get html elements that are contained in two other html elements using htmlagilitypack with C #.

As an example, I have the following:

<div id="div1" style="style definition here"> <strong> <font face="Verdana" size="2">Your search request retrieved 0 matches.</font> </strong> <font face="Verdana" size="2">Some more text here.</font> <br><br> <!--more html here--> </div>

I want to return everything between

 <div id="div1">

and first

 <br>

not returning any of these items.

I can’t come up with the syntax needed for this, so if anyone can explain to me the best way to get the html that exists between two other famous start tags, ignoring the end tags, I would really appreciate it.

I should also mention that I need to first find the div with id div1 within the html environment of the full webpage.

I do not need the actual nodes in order to have reference equality with the nodes that came from a particular HtmlDocument , they just have to be the same in content.

+4

c # asp.net html-agility-pack

kseeley Sep 7 '12 at 7:08

source share

1 answer

casperOne · Accepted Answer · 2012-09-10T17:39:52+0000

When HtmlNode instances are HtmlNode , multiple calls to the same node will return the same link. You can take advantage of this (although this is implementation detail, so be careful).

Basically, you will get all descendants that are elements prior to node. You select node to start with:

 HtmlNode divNode = doc.DocumentNode.SelectSingleNode("div[@id='div1']");

node you want to go to:

 // Note that in this case, working off the first node is not necessary, just // convenient for this example. HtmlNode brNode = divNode.SelectSingleNode("br");

And then use the TakeWhile extension method on the Enumerable class to take all the elements to the second element, for example:

 // The nodes. IEnumerable<HtmlNode> nodes = divNode.Descendants(). TakeWhile(n => n != brNode). Where(n => n.NodeType == HtmlNodeType.Element);

This is a comparison in the TakeWhile method ( n => n != brNode ), which depends on link comparison (this is part of the implementation detail).

The last filter should only give you the element nodes, since you usually get SelectSingleNode calls; if you want to handle other types of node you can omit this.

Run through these nodes:

 foreach (HtmlNode node in nodes) { // Print. Console.WriteLine("Node: {0}", node.Name); }

It produces:

 Node: strong Node: font Node: font

How can I get html between 2 surrounding html elements using htmlagilitypack?

More articles: