How to handle HTML using regular expressions in C #?

How to handle HTML using regular expressions in C #?

For example, this HTML

<s2> t1 </s2> <img src='1.gif' /> <span> span1 <span/> 

I'm trying to get

 1. <s2> 2. t1 3. </s2> 4. <img src='1.gif' /> 5. <span> 6. span1 7. <span/> 

How to do this using regular expressions in C #?

In my case, HTML input is not well-formed XML, like XHTML. Therefore, I cannot use XML parsers for this.

0
source share
5 answers

I used this regx in C # and it works. Thanks for all your answers.

 <([^<]*)>|([^<]*) 
0
source

Regular expressions are a very poor way to parse HTML. If you can guarantee that your input will be well-formed XML (i.e. XHTML), you can use XmlReader to read the elements and then print them as you like.

+6
source

This has already been answered literally dozens of times, but he repeats: regular expressions can parse only ordinary languages, so they are called regular expressions. HTML is not a regular language (as probably every college student has proven at least once in the last decade) and therefore cannot be parsed by regular expressions.

+4
source

You might want to try the Html Agility Pack, http://www.codeplex.com/htmlagilitypack . It even processes invalid HTML.

+3
source

you can just use string functions. make <and> as your indicator for parsing.

-3
source

Source: https://habr.com/ru/post/1369052/


All Articles