Why does a minimal (non-greedy) match affect the end of a '$' string character?

Question

Why does a minimal (non-greedy) match affect the end of a '$' string character?

EDIT: delete the original example because it caused the helper answers. also fixed the header.

The question is why the presence of "$" in the regular expression affects the greed of the expression:

Here is a simpler example:

>>> import re >>> str = "baaaaaaaa" >>> m = re.search(r"a+$", str) >>> m.group() 'aaaaaaaa' >>> m = re.search(r"a+?$", str) >>> m.group() 'aaaaaaaa'

"?" seems to be doing nothing. Note that when "$" is deleted, then "?" observed:

 >>> m = re.search(r"a+?", str) >>> m.group() 'a'

EDIT: In other words, “a +? $” Matches ALL a, not just the last one, this is not what I expected. Here is a description of the regular expression "+?" from python docs : "Adding"? after the qualifier makes it in the match in a non-greedy or minimal way, as several characters will correspond. "

In this example, this is not like: the string "a" matches the regular expression "a +? $", So why is the match for the same regular expression in the string "baaaaaaa" just one a (the rightmost)?

+6

python regex non-greedy

krumpelstiltskin May 03 '11 at 23:44

source share

6 answers

An undesirable modifier only affects where the match stops, and not when it begins. If you want to start the match as long as possible, you need to add .+? to the beginning of your template.

Without $ your template might be less greedy and stop earlier because it should not match the end of the line.

EDIT:

More details ... In this case:

 re.search(r"a+?$", "baaaaaaaa")

the regex engine will ignore everything up to the first "a" because re.search works. It will match the first a and “wants” to return a match, except that it does not match the pattern yet, because it must achieve a match for $ . So he just keeps eating a one at a time and checks $ . If he were greedy, he would not check $ after each a , but only after he could no longer match a .

But in this case:

 re.search(r"a+?", "baaaaaaaa")

The regex engine checks if it has a complete match after the first match (because it is not greedy) and succeeds because in this case there is no $ .

+4

Mu mind May 04, '11 at 0:50

source share

The presence of $ in a regular expression does not affect the greed of the expression. This simply adds another condition that must be met in order to achieve full compliance.

Both a+ and a+? must consume the first a that they find. If a follows this more than a 's, a+ goes ahead and consumes them too, and a+? matches only one. If there was anything else for the regular expression, would a+ agree to a smaller amount of a , and a+? would consume more if this were done to achieve compliance.

With a+$ and a+?$ You added one more condition: match at least one a followed by the end of the line. a+ still consumes all a initially, then it is pushed to the anchor ( $ ). This is done on the first try, so a+ not required to return any of its a .

On the other hand, a+? initially consumes only one a before transferring to $ . This fails, so control returns to a+? which consumes another a and shuts off again. And so it will be, as long as a+? will not destroy the last a and $ will finally succeed. So yes, a+?$ Matches the same number a as a+$ , but does it reluctantly, not greedily.

Regarding the most extreme extreme rule, which was mentioned elsewhere, which has never been applied to Perle-derived regular expressions such as Python's. Even without unjustified quantifiers, they can always return less than the maximum match due to ordered alternation . I think John had the right idea: Perl-derived (or with regular expression) tastes should be called impatience , not greedy.

I believe that the rule of the leftmost rule applies only to POSIX NFA regular expressions that use NFA engines under the hood, but should return the same results as the regular DFA (text) expression.

+3

Alan moore May 05 '11 at 10:39

source share

There are two questions here. You used group () without specifying a group, and I can say that you are confused between the behavior of regular expressions with a group explicitly in brackets and without in brackets. This kind of parenthesis-free behavior that you observe is just a shortcut that Python provides, and you need to read the documentation for group () to fully understand it.

 >>> import re >>> string = "baaa" >>> >>> # Here you're searching for one or more `a`s until the end of the line. >>> pattern = re.search(r"a+$", string) >>> pattern.group() 'aaa' >>> >>> # This means the same thing as above, since the presence of the `$` >>> # cancels out any meaning that the `?` might have. >>> pattern = re.search(r"a+?$", string) >>> pattern.group() 'aaa' >>> >>> # Here you remove the `$`, so it matches the least amount of `a` it can. >>> pattern = re.search(r"a+?", string) >>> pattern.group() 'a'

The bottom line indicates that line a+? corresponds to one period a . However, a+?$ Matches a to the end of the line. Please note that without explicit grouping it will be difficult for you to get ? to mean anything at all. In general, in any case, it’s better to clearly indicate what you are grouping with parentheses. Let me give an example with explicit groups.

 >>> # This is close to the example pattern with `a+?$` and therefore `a+$`. >>> # It matches `a`s until the end of the line. Again the `?` can't do anything. >>> pattern = re.search(r"(a+?)$", string) >>> pattern.group(1) 'aaa' >>> >>> # In order to get the `?` to work, you need something else in your pattern >>> # and outside your group that can be matched that will allow the selection >>> # of `a`s to be lazy. # In this case, the `.*` is greedy and will gobble up >>> # everything that the lazy `a+?` doesn't want to. >>> pattern = re.search(r"(a+?).*$", string) >>> pattern.group(1) 'a'

_{Edit: Deleted text related to old versions of the question.}

+1

arussell84 May 03 '11 at 23:46

source share

The answer to the original question:

Why is the first search range () somewhat "/" and not accept the shortest match?

An undesired subpattern will match the shortest match with the whole pattern. In your example, the last subpattern is $ , so the previous ones should stretch to the end of the line.

Answer to the revised question:

An undesired subpattern will match the shortest match with the whole pattern.

Another way to look at this: a non-greedy subpattern will initially match the shortest possible match. However, if this leads to the failure of the entire template, it will be repeated with an additional character. This process continues until the subpattern works (which causes the entire template to fail) or the entire template matches.

+1

John machin May 04 '11 at 12:29

source share

If your question does not contain important information, you do not need and should not use a regular expression for this task.

 >>> import os >>> p = "/we/shant/see/this/butshouldseethis" >>> os.path.basename(p) butshouldseethis

0

jonesy May 03 '11 at 23:49

source share

Fred nurk · Accepted Answer · 2011-05-04T00:37:03+0000

Matches “ordered” “leftmost, longest” ; however, "longest" is a term used before the non-greedy was resolved, and instead means something like "the preferred number of repetitions for each atom." Being the most left is more important than the number of repetitions. Thus, "a +? $" Will not match the last A in "baaaaa", because the match in the first A starts earlier on the line.

(The answer changed after the clarification of the OP in the comments. See history for the previous text.)

Why does a minimal (non-greedy) match affect the end of a '$' string character?

More articles: