Html Agility Pack gets all items by class

I take a hit in the hmml agility pack and cannot find the right path for this.

For example:

var findclasses = _doc.DocumentNode.Descendants("div").Where(d => d.Attributes.Contains("class")); 

However, obviously, you can add classes much more than divs, so I tried this.

 var allLinksWithDivAndClass = _doc.DocumentNode.SelectNodes("//*[@class=\"float\"]"); 

But this does not apply to cases when you add several classes, and "float" is only one of them.

 class="className float anotherclassName" 

Is there any way to handle all this? I basically want to highlight all nodes that have class = and contain float.

** The answer was documented on my blog with a full explanation: Html Agility Pack Get all items by class

+60
html c # html-agility-pack
Dec 07 '12 at 21:15
source share
5 answers

(Updated 2018-03-17)

Problem:

The problem, as you noticed, is that String.Contains does not check word boundaries, so Contains("float") will return true for "foo float bar" (correct) and "expand" (which is wrong).

The solution is to ensure that a "float" (or any other class name of your choice) appears next to the word boundary at both ends. A word boundary is the beginning (or end) of a line (or line), spaces, certain punctuation marks, etc. In most regular expressions, this is \b . So you want just a regular expression: \bfloat\b .

The downside of using a Regex instance is that they can run slowly if you don't use the .Compiled parameter - and they can be slow to compile. Therefore, you must cache the regex instance. This is more complicated if the name of the class you are looking for changes at runtime.

In addition, you can search a string for words by word boundaries without using a regular expression, implementing the regular expression as a function of C # string processing, trying not to call a new line or other distribution of objects (for example, without using String.Split ).

Approach 1: Using regex:

Suppose you just want to search for elements with the same class name specified in the design:

 class Program { private static readonly Regex _classNameRegex = new Regex( @"\bfloat\b", RegexOptions.Compiled ); private static IEnumerable<HtmlNode> GetFloatElements(HtmlDocument doc) { return doc .Descendants() .Where( n => n.NodeType == NodeType.Element ) .Where( e => e.Name == "div" && _classNameRegex.IsMatch( e.GetAttributeValue("class", "") ) ); } } 

If you need to select one class name at runtime, you can create a regular expression:

 private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) { Regex regex = new Regex( "\\b" + Regex.Escape( className ) + "\\b", RegexOptions.Compiled ); return doc .Descendants() .Where( n => n.NodeType == NodeType.Element ) .Where( e => e.Name == "div" && regex.IsMatch( e.GetAttributeValue("class", "") ) ); } 

If you have several name classes and you want to match all of them, you can create an array of Regex objects and make sure they are all matching, or combine them into one Regex using lookarounds, but this results in monstrously complex expressions - so using Regex[] is probably better:

 using System.Linq; private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String[] classNames) { Regex[] exprs = new Regex[ classNames.Length ]; for( Int32 i = 0; i < exprs.Length; i++ ) { exprs[i] = new Regex( "\\b" + Regex.Escape( classNames[i] ) + "\\b", RegexOptions.Compiled ); } return doc .Descendants() .Where( n => n.NodeType == NodeType.Element ) .Where( e => e.Name == "div" && exprs.All( r => r.IsMatch( e.GetAttributeValue("class", "") ) ) ); } 

Approach 2: Using string matching without regular expressions:

The advantage of using a custom C # method to perform string matching instead of regular expression is hypothetically faster performance and reduced memory usage (although in some cases Regex may be faster), always check your code first, children!)

This method is below: CheapClassListContains provides a quick string matching function with regex.IsMatch that can be used in the same way as regex.IsMatch :

 private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) { return doc .Descendants() .Where( n => n.NodeType == NodeType.Element ) .Where( e => e.Name == "div" && CheapClassListContains( e.GetAttributeValue("class", ""), className, StringComparison.Ordinal ) ); } /// <summary>Performs optionally-whitespace-padded string search without new string allocations.</summary> /// <remarks>A regex might also work, but constructing a new regex every time this method is called would be expensive.</remarks> private static Boolean CheapClassListContains(String haystack, String needle, StringComparison comparison) { if( String.Equals( haystack, needle, comparison ) ) return true; Int32 idx = 0; while( idx + needle.Length <= haystack.Length ) { idx = haystack.IndexOf( needle, idx, comparison ); if( idx == -1 ) return false; Int32 end = idx + needle.Length; // Needle must be enclosed in whitespace or be at the start/end of string Boolean validStart = idx == 0 || Char.IsWhiteSpace( haystack[idx - 1] ); Boolean validEnd = end == haystack.Length || Char.IsWhiteSpace( haystack[end] ); if( validStart && validEnd ) return true; idx++; } return false; } 

Approach 3: Using the CSS Selector Library:

HtmlAgilityPack is somewhat stalled, does not support .querySelector and .querySelectorAll , but there are third-party libraries that extend the HtmlAgilityPack with it: namely Fizzler and CssSelectors . Both Fizzler and CssSelectors implement QuerySelectorAll , so you can use it like this:

 private static IEnumerable<HtmlNode> GetDivElementsWithFloatClass(HtmlDocument doc) { return doc.QuerySelectorAll( "div.float" ); } 

With classes defined at runtime:

 private static IEnumerable<HtmlNode> GetDivElementsWithClasses(HtmlDocument doc, IEnumerable<String> classNames) { String selector = "div." + String.Join( ".", classNames ); return doc.QuerySelectorAll( selector ); } 
+85
Dec 08
source share

You can solve your problem using the "contains" function in your Xpath request, as shown below:

 var allElementsWithClassFloat = _doc.DocumentNode.SelectNodes("//*[contains(@class,'float')]") 

To reuse this in a function, do something similar to the following:

 string classToFind = "float"; var allElementsWithClassFloat = _doc.DocumentNode.SelectNodes(string.Format("//*[contains(@class,'{0}')]", classToFind)); 
+76
Dec 30 '12 at 2:12
source share

I used this extension method a lot in my project. Hope this helps one of you guys.

 public static bool HasClass(this HtmlNode node, params string[] classValueArray) { var classValue = node.GetAttributeValue("class", ""); var classValues = classValue.Split(' '); return classValueArray.All(c => classValues.Contains(c)); } 
+2
Jan 05 '17 at 20:18
source share
 public static List<HtmlNode> GetTagsWithClass(string html,List<string> @class) { // LoadHtml(html); var result = htmlDocument.DocumentNode.Descendants() .Where(x =>x.Attributes.Contains("class") && @class.Contains(x.Attributes["class"].Value)).ToList(); return result; } 
0
Feb 26 '17 at 21:03
source share

You can use the following script:

 var findclasses = _doc.DocumentNode.Descendants("div").Where(d => d.Attributes.Contains("class") && d.Attributes["class"].Value.Contains("float") ); 
-7
Aug 31 '14 at 17:57
source share



All Articles