Choosing XPath in HTMLAgilityPack not working properly

I am writing a simple screen scripting program in C # for which I need to select all the input data placed inside one form with the name "aspnetForm" (there are 2 forms on the page and I don’t want to enter data from another), and all the inputs in this the form is placed inside different tables, divs or only at the first level of this form.

So, I wrote a very simple XPath query:

//form[@id='aspnetForm']//input 

It works as expected in all the browsers I tested (Chrome, IE, Firefox) - it returns what I want.

But in HTMLAgilityPack it doesn't work at all - SelectNodes always return NULL.

These queries I wrote for tests work fine, but do not return what I want. First, select all the inputs that are the first parents for my form, and the second only returns the form:

 //form[@id='aspnetForm']/input //form[@id='aspnetForm'] 

Yes, I know that I can just list the nodes from the last query or make another SelectNodes on it, but I really don't want to do this. I want to use the same query as in browsers.

Is XPath currently damaged in HTMLAgilityPack? Are there any XPath alternatives for C #?

UPDATE : test code:

 using HtmlAgilityPack; using Microsoft.VisualStudio.TestTools.UnitTesting; namespace HtmlAGPTests { [TestClass] public class XPathTests { private const string html = "<form id=\"aspnetForm\">" + "<input name=\"first\" value=\"first\" />" + "<div>" + "<input name=\"second\" value=\"second\" />" + "</div>" + "</form>"; private static HtmlNode GetHtmlDocumentNode() { var document = new HtmlDocument(); document.LoadHtml(html); return document.DocumentNode; } [TestMethod] public void TwoLevelXpathTest() // fail - nodes is NULL actually. { var query = "//form[@id='aspnetForm']//input"; // what i want var documentNode = GetHtmlDocumentNode(); var inputNodes = documentNode.SelectNodes(query); Assert.IsTrue(inputNodes.Count == 2); } [TestMethod] public void TwoSingleLevelXpathsTest() // works { var formQuery = "//form[@id='aspnetForm']"; var inputQuery = "//input"; var documentNode = GetHtmlDocumentNode(); var formNode = documentNode.SelectSingleNode(formQuery); var inputNodes = formNode.SelectNodes(inputQuery); Assert.IsTrue(inputNodes.Count == 2); } [TestMethod] public void SingleLevelXpathTest() // works { var query = "//form[@id='aspnetForm']"; var documentNode = GetHtmlDocumentNode(); var formNode = documentNode.SelectSingleNode(query); Assert.IsNotNull(formNode); } } } 
+5
c # xpath screen-scraping
Apr 23 '14 at 0:25
source share
1 answer

The unexpected behavior in your test is due to the fact that html contains a <form> element. Here is a related discussion:

Ariman: "I found that after parsing any node there are no child nodes. All nodes that should be inside the form (, etc.) are created as they are siblings and not children.

VikciaR: "The HTML specification specification may overlap in the specification tag, so the Htmlagilitypack processes this node a little different ..."

[ CodePlex talk: no child nodes for FORM objects ]

And as suggested by VikciaR , try changing the initialization of the test code as follows:

 private static HtmlNode GetHtmlDocumentNode() { var document = new HtmlDocument(); document.LoadHtml(html); //execute this line once HtmlNode.ElementsFlags.Remove("form"); return document.DocumentNode; } 

Note: The inputQuery value in the TwoSingleLevelXpathsTest() test method must be .//input . Note the dot ( . ) At the beginning to indicate that this request refers to the current node. Otherwise, it will search from the root, ignoring the previous formQuery (without a dot, you can change formQuery to anything, until it returns null, inputQuery always returns the same result).

+4
Apr 23 '14 at 4:42
source share



All Articles