In Regex, why is "((. | \ S) *?)" Different from "\ s *. *"

Not a complete newbie, but I still don't understand everything about regular expressions. I tried using regex to cut out <p> tags and my first attempt

<p\s*.*> 

was so greedy that he caught the whole line

 <p someAttributes='example'>SomeText</p> 

I got it to work with

 ((.|\s)*?) 

It seems like it should be just as greedy, can someone help me understand why this is not so?

Trying to make this question as irrelevant as possible, but I did it with ColdFusion reReplaceNoCase, if that matters a lot.

+6
source share
3 answers

The key difference is the *? part *? , which creates a reluctant quantifier , and so it tries to match as little as possible. The standard quantifier * is a greedy quantifier and tries to match as much as possible.

See Greedy vs. Grudging vs. Potential Quantifiers

As Set Robertson pointed out, you can use a regular expression that is independent of greedy / reluctant behavior. Indeed, you can write possessive regex for better performance:

 <p\s*+[^>]*+> 

Here \s*+ matches any number of spaces, and [^>]*+ matches any number of characters except > . Both quantifiers do not track in the event of a mismatch, which improves the execution time in the event of a mismatch, as well as for some regular expression implementations also in the event of a match (since internal backtracking data may be omitted).

Please note that if there are other tags starting with <p (they have not written HTML directly for a long time), you also agree to them. If you do not want this, use a regex:

 <p(\s++[^>]*+)?> 

This makes the entire section between <p and > optional.

+12
source

Well, either the regular expression will absolutely match, so the question is debatable. Using a non-greedy parser will probably come close to what you want, but it can still have very unexpected results.

While you do not have to map html / xml to RE, you probably want something like:

 <p\s*([^>]*)> 

To put any p attributes in $ 1.

+2
source
 <p\s*.*> 

Searches for "p", 0 or more spaces, 0 or more characters, '>'. The "any character" group contains ">", so the regex finds the entire string.

0
source

Source: https://habr.com/ru/post/889899/


All Articles