In Regex, why is "((. | \ S) ?)" Different from "\ s . *"

Question

In Regex, why is "((. | \ S) ?)" Different from "\ s . *"

Not a complete newbie, but I still don't understand everything about regular expressions. I tried using regex to cut out <p> tags and my first attempt

<p\s*.*>

was so greedy that he caught the whole line

 <p someAttributes='example'>SomeText</p>

I got it to work with

 ((.|\s)*?)

It seems like it should be just as greedy, can someone help me understand why this is not so?

Trying to make this question as irrelevant as possible, but I did it with ColdFusion reReplaceNoCase, if that matters a lot.

+6

regex

invertedSpear Jun 06 '11 at 20:04

source share

3 answers

Well, either the regular expression will absolutely match, so the question is debatable. Using a non-greedy parser will probably come close to what you want, but it can still have very unexpected results.

While you do not have to map html / xml to RE, you probably want something like:

 <p\s*([^>]*)>

To put any p attributes in $ 1.

+2

Seth robertson Jun 06 '11 at 20:08

source share

 <p\s*.*>

Searches for "p", 0 or more spaces, 0 or more characters, '>'. The "any character" group contains ">", so the regex finds the entire string.

0

Alessandro pezzato Jun 06 '11 at 20:12

source share

Christian semrau · Accepted Answer · 2011-06-06T20:07:12+0000

The key difference is the *? part *? , which creates a reluctant quantifier , and so it tries to match as little as possible. The standard quantifier * is a greedy quantifier and tries to match as much as possible.

See Greedy vs. Grudging vs. Potential Quantifiers

As Set Robertson pointed out, you can use a regular expression that is independent of greedy / reluctant behavior. Indeed, you can write possessive regex for better performance:

 <p\s*+[^>]*+>

Here \s*+ matches any number of spaces, and [^>]*+ matches any number of characters except > . Both quantifiers do not track in the event of a mismatch, which improves the execution time in the event of a mismatch, as well as for some regular expression implementations also in the event of a match (since internal backtracking data may be omitted).

Please note that if there are other tags starting with <p (they have not written HTML directly for a long time), you also agree to them. If you do not want this, use a regex:

 <p(\s++[^>]*+)?>

This makes the entire section between <p and > optional.

In Regex, why is "((. | \ S) *?)" Different from "\ s *. *"

More articles:

In Regex, why is "((. | \ S) ?)" Different from "\ s . *"