XPath / HtmlAgilityPack: how to find element (a) with a specific value for an attribute (href) and find neighboring columns of a table?

Question

XPath / HtmlAgilityPack: how to find element (a) with a specific value for an attribute (href) and find neighboring columns of a table?

I am pretty desperate because I can’t figure out how to achieve what I said in the question. I already read a lot of similar examples, but did not find what works in the exact situation. So let's say I have the following code:

<table><tr> <td><a href="url-a">text A</a></td><td><a>id A</a></td><td><a>img A</a></td> <td><a href="url-b">text B</a></td><td><a>id B</a></td><td><a>img B</a></td> <td><a href="url-c">text C</a></td><td><a>id C</a></td><td><a>img C</a></td> </tr></table>

Now I already have a part of url-a. I basically want to know how I can get id A and img A. I'm trying to "find" a string with XPath, but I cannot find a way to make it work. In addition, it is possible that information is generally absent. This is my last attempt (seriously, I have been messing with this for more than 3 hours, trying in many ways):

 if (htmlDoc.DocumentNode.SelectSingleNode(@"/a[contains(@href, 'part-url-a')]") != null) string ida = htmlDoc.DocumentNode.SelectSingleNode(@"/a[contains(@href, 'part-url-a')]/following-sibling::a").InnerText;

Well, this is apparently wrong, so I would be very happy if someone could help me here. I would also appreciate it if someone could point me to some Web site that explains XPath in detail and notation / syntax with examples like this. Books are also welcome.

PS: I know that I could achieve my goal without XPath at all with Regex or just with StreamReader in C # and check if each line contains what I need, but a) it is too fragile for my needs, because the code can be cool line-breaks and b) I really want to stay the same, fully adhering to XPath for everything I do in this project.

Thanks in advance for your help!

+6

html c # visual-studio xpath html-agility-pack

Gernony Sep 03 '11 at 19:10

source share

2 answers

You have severely broken HTML with disabling closing td tags. Correct them, please. This is just an ugly picture of this markup.

That being said, the Html Agility Pack can handle any shit you throw at it, so here's how to continue and analyze the garbage you have and find the id and img values given by href :

 class Program { static void Main() { var doc = new HtmlDocument(); doc.Load("test.html"); var anchor = doc.DocumentNode.SelectSingleNode("//a[contains(@href, 'url-a')]"); if (anchor != null) { var id = anchor.ParentNode.SelectSingleNode("following-sibling::td/a"); if (id != null) { Console.WriteLine(id.InnerHtml); var img = id.ParentNode.SelectSingleNode("following-sibling::td/a"); if (img != null) { Console.WriteLine(img.InnerHtml); } } } } }

+2

Darin Dimitrov Sep 03 '11 at 19:25

source share

Dimitre novatchev · Accepted Answer · 2011-09-03T19:49:02+0000

Use the following XPath expressions :

  /*/tr/td[a[@href='url-a']] /following-sibling::td[1] /a/text()

When evaluating the provided (corrected but corrected) XML document :

 <table><tr> <td><a href="url-a">text A</a></td><td><a>id A</a></td><td><a>img A</a></td> <td><a href="url-b">text B</a></td><td><a>id B</a></td><td><a>img B</a></td> <td><a href="url-c">text C</a></td><td><a>id C</a></td><td><a>img C</a></td> </tr></table>

the desired node text is selected :

 id A

Similarly, this is an XPath expression :

  /*/tr/td[a[@href='url-a']] /following-sibling::td[2] /a/text()

when evaluated using the same XML document (above), selects the other desired text node :

 img A

XSLT Based Validation :

When this conversion is applied to an XML document (above):

 <xsl:stylesheet version="1.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"> <xsl:output omit-xml-declaration="yes" indent="yes"/> <xsl:template match="/"> <xsl:copy-of select= "/*/tr/td[a[@href='url-a']] /following-sibling::td[1] /a/text()"/> <xsl:text>&#10;</xsl:text> <xsl:copy-of select= "/*/tr/td[a[@href='url-a']] /following-sibling::td[2] /a/text()"/> </xsl:template> </xsl:stylesheet>

The desired results were obtained :

 id A img A

XPath / HtmlAgilityPack: how to find element (a) with a specific value for an attribute (href) and find neighboring columns of a table?

More articles: