How can I extract only text from html

I have a requirement to extract all the text that is present in the <body> html. Html input example: -

 <html> <title>title</title> <body> <h1> This is a big title.</h1> How are doing you? <h3> I am fine </h3> <img src="abc.jpg"/> </body> </html> 

The output should be: -

 This is a big title. How are doing you? I am fine 

I want to use only HtmlAgility for this purpose. No regular expressions.

I know how to load an HtmlDocument, and then using xquery like '// body', we can get the contents of the body. But how do I remove html, as I showed in the output?

Thanks in advance:)

+6
source share
4 answers

You can use the InnerText body:

 string html = @" <html> <title>title</title> <body> <h1> This is a big title.</h1> How are doing you? <h3> I am fine </h3> <img src=""abc.jpg""/> </body> </html>"; HtmlDocument doc = new HtmlDocument(); doc.LoadHtml(html); string text = doc.DocumentNode.SelectSingleNode("//body").InnerText; 

Further you can collapse places and new lines:

 text = Regex.Replace(text, @"\s+", " ").Trim(); 

Note, however, that while it works in this case, markup, such as hello<br>world or hello<i>world</i> , will be converted using InnerText to helloworld - tag removal. It is difficult to solve this problem, because the display depends on CSS, and not just on the markup.

+5
source

How to use XPath expression '//body//text()' to select all text nodes?

+3
source

Normally for parsing html I would recommend an HTML parser, however, since you want to remove all html tags, a simple regular expression should work.

+1
source

You can use NUglify , which supports extracting text from HTML:

 var result = Uglify.HtmlToText("<div> <p>This is <em> a text </em></p> </div>"); Console.WriteLine(result.Code); // prints: This is a text 

Since it uses its own HTML5 parser, it should be reliable enough (especially if the document is error-free) and very fast (without a regular expression, but with a purely recursive descent parser, faster than HtmlAgilityPack and more GC friendly)

+1
source

Source: https://habr.com/ru/post/887045/


All Articles