How can I extract the text visible on the page from its html source?

I tried HtmlAgilityPack and the following code, but it does not grab text from html lists:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(htmlStr); HtmlNode node = doc.DocumentNode; return node.InnerText; 

Here is the code that fails:

 <as html> <p>This line is picked up <b>correctly</b>. List items hasn't...</p> <p><ul> <li>List Item 1</li> <li>List Item 2</li> <li>List Item 3</li> <li>List Item 4</li> </ul></p> </as html> 
+6
source share
2 answers

Part of the code works for me:

 string StripHTML(string htmlStr) { HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(htmlStr); var root = doc.DocumentNode; string s = ""; foreach (var node in root.DescendantNodesAndSelf()) { if (!node.HasChildNodes) { string text = node.InnerText; if (!string.IsNullOrEmpty(text)) s += text.Trim() + " "; } } return s.Trim(); } 
+2
source

Because you need to go through the tree and concat somehow InnerText all nodes

+3
source

Source: https://habr.com/ru/post/907728/


All Articles