(.*) Th...">

Regex - match exactly one tag

I have a regex to extract text from an HTML tag tag:

<FONT FACE=\"Excelsior LT Std Bold\"(.*)>(.*)</FONT>

This works fine until I have some nested font tags. Instead of matching

<FONT FACE="Excelsior LT Std Bold">Fett</FONT>

result for row

<FONT FACE="Excelsior LT Std Bold">Fett</FONT> + <U>Unterstrichen</U> + <FONT FACE="Excelsior LT Std Italic">Kursiv</FONT> und Normal

is an

<FONT FACE="Excelsior LT Std Bold">Fett</FONT> + <U>Unterstrichen</U> + <FONT FACE="Excelsior LT Std Italic"

How to get only the first tag?

+3
source share
4 answers

You need to disable greedy matching with .*?instead .*.

<FONT FACE=\"Excelsior LT Std Bold\"([^>]*)>(.*?)</FONT>

, , BadAttribute="<FooBar>" - FACE <FONT>. , , </FONT>. , . Tomalak - XML, HTML , .

+9

:

<FONT FACE=\"Excelsior LT Std Bold\"[^>]*>(.*?)</FONT>
                                    ^^^^^  ^^^
                                      |     |
     match any character except ">" --+     +--------+
                                                     |
   match anything, but only up to the next </FONT> --+

HTML : .

+3

- , '?'

 <FONT FACE=\"Excelsior LT Std Bold\"(.*?)>(.*?)</FONT>
+2
<FONT[^>]*Excelsior LT Std Bold[^>]*></FONT>

. Phil Haack .

Here is my use of C # for this expression. This was used to remove certain CSS and JS files from the HTTP response.

const string CSSFormat = "<link[^>]*{0}[^>]*css[^>]*>";
const string JSFormat = "<script[^>]*{0}[^>]*js[^>]*></script>";

static readonly Regex OverrideCss = new Regex(string.Format(CSSFormat, "override-"), RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);
static readonly Regex OverrideIconsJs = new Regex(string.Format(JSFormat, "overrideicons"), RegexOptions.IgnoreCase | RegexOptions.Compiled | RegexOptions.Singleline);
0
source

Source: https://habr.com/ru/post/1706971/


All Articles