How can I extract the text visible on the page from its html source?

Question

How can I extract the text visible on the page from its html source?

I tried HtmlAgilityPack and the following code, but it does not grab text from html lists:

HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(htmlStr); HtmlNode node = doc.DocumentNode; return node.InnerText;

Here is the code that fails:

 <as html> <p>This line is picked up <b>correctly</b>. List items hasn't...</p> <p><ul> <li>List Item 1</li> <li>List Item 2</li> <li>List Item 3</li> <li>List Item 4</li> </ul></p> </as html>

+6

html c #

Luke g Feb 05 '12 at 10:58

source share

2 answers

Because you need to go through the tree and concat somehow InnerText all nodes

+3

Svisstack Feb 05 '12 at 23:18

source share

Luke g · Accepted Answer · 2012-02-06T11:53:49+0000

Part of the code works for me:

 string StripHTML(string htmlStr) { HtmlAgilityPack.HtmlDocument doc = new HtmlAgilityPack.HtmlDocument(); doc.LoadHtml(htmlStr); var root = doc.DocumentNode; string s = ""; foreach (var node in root.DescendantNodesAndSelf()) { if (!node.HasChildNodes) { string text = node.InnerText; if (!string.IsNullOrEmpty(text)) s += text.Trim() + " "; } } return s.Trim(); }

How can I extract the text visible on the page from its html source?

More articles: