(Updated 2018-03-17)
Problem:
The problem, as you noticed, is that String.Contains does not check word boundaries, so Contains("float") will return true for "foo float bar" (correct) and "expand" (which is wrong).
The solution is to ensure that a "float" (or any other class name of your choice) appears next to the word boundary at both ends. A word boundary is the beginning (or end) of a line (or line), spaces, certain punctuation marks, etc. In most regular expressions, this is \b . So you want just a regular expression: \bfloat\b .
The downside of using a Regex instance is that they can run slowly if you don't use the .Compiled parameter - and they can be slow to compile. Therefore, you must cache the regex instance. This is more complicated if the name of the class you are looking for changes at runtime.
In addition, you can search a string for words by word boundaries without using a regular expression, implementing the regular expression as a function of C # string processing, trying not to call a new line or other distribution of objects (for example, without using String.Split ).
Approach 1: Using regex:
Suppose you just want to search for elements with the same class name specified in the design:
class Program { private static readonly Regex _classNameRegex = new Regex( @"\bfloat\b", RegexOptions.Compiled ); private static IEnumerable<HtmlNode> GetFloatElements(HtmlDocument doc) { return doc .Descendants() .Where( n => n.NodeType == NodeType.Element ) .Where( e => e.Name == "div" && _classNameRegex.IsMatch( e.GetAttributeValue("class", "") ) ); } }
If you need to select one class name at runtime, you can create a regular expression:
private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) { Regex regex = new Regex( "\\b" + Regex.Escape( className ) + "\\b", RegexOptions.Compiled ); return doc .Descendants() .Where( n => n.NodeType == NodeType.Element ) .Where( e => e.Name == "div" && regex.IsMatch( e.GetAttributeValue("class", "") ) ); }
If you have several name classes and you want to match all of them, you can create an array of Regex objects and make sure they are all matching, or combine them into one Regex using lookarounds, but this results in monstrously complex expressions - so using Regex[] is probably better:
using System.Linq; private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String[] classNames) { Regex[] exprs = new Regex[ classNames.Length ]; for( Int32 i = 0; i < exprs.Length; i++ ) { exprs[i] = new Regex( "\\b" + Regex.Escape( classNames[i] ) + "\\b", RegexOptions.Compiled ); } return doc .Descendants() .Where( n => n.NodeType == NodeType.Element ) .Where( e => e.Name == "div" && exprs.All( r => r.IsMatch( e.GetAttributeValue("class", "") ) ) ); }
Approach 2: Using string matching without regular expressions:
The advantage of using a custom C # method to perform string matching instead of regular expression is hypothetically faster performance and reduced memory usage (although in some cases Regex may be faster), always check your code first, children!)
This method is below: CheapClassListContains provides a quick string matching function with regex.IsMatch that can be used in the same way as regex.IsMatch :
private static IEnumerable<HtmlNode> GetElementsWithClass(HtmlDocument doc, String className) { return doc .Descendants() .Where( n => n.NodeType == NodeType.Element ) .Where( e => e.Name == "div" && CheapClassListContains( e.GetAttributeValue("class", ""), className, StringComparison.Ordinal ) ); } /// <summary>Performs optionally-whitespace-padded string search without new string allocations.</summary> /// <remarks>A regex might also work, but constructing a new regex every time this method is called would be expensive.</remarks> private static Boolean CheapClassListContains(String haystack, String needle, StringComparison comparison) { if( String.Equals( haystack, needle, comparison ) ) return true; Int32 idx = 0; while( idx + needle.Length <= haystack.Length ) { idx = haystack.IndexOf( needle, idx, comparison ); if( idx == -1 ) return false; Int32 end = idx + needle.Length; // Needle must be enclosed in whitespace or be at the start/end of string Boolean validStart = idx == 0 || Char.IsWhiteSpace( haystack[idx - 1] ); Boolean validEnd = end == haystack.Length || Char.IsWhiteSpace( haystack[end] ); if( validStart && validEnd ) return true; idx++; } return false; }
Approach 3: Using the CSS Selector Library:
HtmlAgilityPack is somewhat stalled, does not support .querySelector and .querySelectorAll , but there are third-party libraries that extend the HtmlAgilityPack with it: namely Fizzler and CssSelectors . Both Fizzler and CssSelectors implement QuerySelectorAll , so you can use it like this:
private static IEnumerable<HtmlNode> GetDivElementsWithFloatClass(HtmlDocument doc) { return doc.QuerySelectorAll( "div.float" ); }
With classes defined at runtime:
private static IEnumerable<HtmlNode> GetDivElementsWithClasses(HtmlDocument doc, IEnumerable<String> classNames) { String selector = "div." + String.Join( ".", classNames ); return doc.QuerySelectorAll( selector ); }