How can I get all the content in the <td> tag using the Agility Pack?

Question

How can I get all the content in the <td> tag using the Agility Pack?

So, I am writing an application that will do a little screen cleansing. I use the HTML Agility Pack to load an entire HTML page into an instance HtmlDocoumentcalled doc. Now I want to analyze this document looking for this:

<table border="0" cellspacing="3">
<tr><td>First rows stuff</td></tr>
<tr>
<td> 
The data I want is in here <br /> 
and it seperated by these annoying <br /> 's.

No id's, classes, or even a single <p> tag. </p> Just a bunch of <br />  tags.
</td> 
</tr> 
</table>

So I just need to get the data in the second row. How can i do this? Should I use regex or something else?

Update: This is how I downloaddoc

HtmlWeb hw = new HtmlWeb();
HtmlDocument doc = hw.Load(Url);

+3

c # screen-scraping html-agility-pack

Bob dylan Jun 12 '10 at 5:26

source share

5 answers

som̡et̨hińg Else

+1

FelipeAls 12 . '10 5:33

, xml.

0

Josh Sterling 12 . '10 5:30

"- " - - HTML HTML, . #, , HTML Agility Pack .

0

Alex Martelli 12 . '10 5:31

If you are already using the Agility package, then it’s just a matter of using any thing doc.DocumentNode.SelectNodes("//table[@cellspacing='3']")to get the table in the document. Try looking at sample documentation and coding. Since you already have structured data, it is ridiculous to go back to text data and repeat it.

0

Eclipse Jun 12 '10 at 5:43

source share

Mark Byers · Accepted Answer · 2010-06-12T05:43:06+0000

Html Agility Pack, , , . , XPath. - :

HtmlDocument doc = new HtmlDocument();
doc.Load("input.html");
HtmlNode node = doc.DocumentNode
                   .SelectNodes("//table[@cellspacing='3']/tr[2]/td")
                   .Single();
string text = node.InnerText;

How can I get all the content in the <td> tag using the Agility Pack?

More articles: