Strip HTML and CSS in C #

I create letters in one of my solutions and must provide both html and text messages from a given html page.

However, I did not find any real way to remove html, js and css from any html template that clients can provide.

Is there any simple solution for this, maybe a component that handles all this, or do I need to run a regex puzzle? And is it possible to create a bulletproof regular expression for all possible tags?

Hello

+4
source share
5 answers

Give HtmlAgilityPack . It has methods for extracting text from an HTML document.

You just need to do the following:

var doc = new HtmlDocument(); doc.LoadHtml(htmlStr); var node = doc.DocumentNode; var textContent = node.InnerText; 
+8
source

As a component that can remove html: Html Agility Pack

+1
source

You may find the Html Agility Pack useful for your situation.

+1
source

Take a look here: Parse HTMLAgilityPack in InnerHTML . There is an answer on how to do this using the Html Agility Pack

+1
source

On this page, you can find a very fast algorithm for removing HTML from string input. Although there are some issues with invalid HTML, this is still a great resource. http://www.dotnetperls.com/remove-html-tags

0
source

Source: https://habr.com/ru/post/1345998/


All Articles