Why is this regex expression not working?

$ echo '!abcae20' | grep -o -P '(?=.*\d)\w{4,}' 

It will not give anything.

But the following works:

 $ echo '!abcae20' | grep -o -P '.*?(?=.*\d)\w{4,}' !abcae20 

Can someone give me an explanation?

+6
source share
5 answers

In your first expression, the lookahead statement matches your input as ( greedy ).

Running a debug test in your regex displays the following.

 Matching REx "(?=.*\d)\w{4,}" against "!abcae20" 0 <> <!abcae20> | 1: IFMATCH[0](8) 0 <> <!abcae20> | 3: STAR(5) REG_ANY can match 8 times out of 2147483647... 8 <!abcae20> <> | 5: DIGIT(6) failed... 7 <!abcae2> <0> | 5: DIGIT(6) 8 <!abcae20> <> | 6: SUCCEED(0) subpattern success... 0 <> <!abcae20> | 8: CURLY {4,32767}(11) ALNUMU can match 0 times out of 2147483647... failed... Match failed 
  • An explanation of what caused the failure.

    • <! abcae20 >

       The greedy quantifier first matches as much as possible. So the .* here is matching the entire string. 
    • < !abcae20 >

       Then tries to match any numeric character following, but there are no characters left to match. 
    • < !abcae2 0 >

       So it backtracks making the greedy match, match one less character leaving the --> 0 <-- at the end unmatched. 
    • < !abcae20 >

       So it backtracks again matching one less leaving it unmatched. 
    • <! abcae20 >

       So it backtracks one more step matching one less again and failing your match. 
  • Regular expression explanation:

     (?= look ahead to see if there is: .* any character except \n (0 or more times) \d digits (0-9) ) end of look-ahead \w{4,} word characters (az, AZ, 0-9, _) (at least 4 times) 

Your second expression matches ! with the previous non-greedy .*? followed by your lookahead statement that matches !abcae2 , and then returns to match your word characters and the full line.

  • Regular expression explanation:

     .*? any character except \n (0 or more times) (?= look ahead to see if there is: .* any character except \n (0 or more times) \d digits (0-9) ) end of look-ahead \w{4,} word characters (az, AZ, 0-9, _) (at least 4 times) 
+2
source

It works:

 echo '!abcae20' | grep -o -P '.*?(?=.*\d)\w{4,} 

Because .*? matches ! , (?=.*\d) corresponds to abcae20 , and \w{4,} corresponds to abcae20.

In that:

 echo '!abcae20' | grep -o -P '(?=.*\d)\w{4,}' 

The image of the head matches !abcae20 , being greedy. However, \w{4,} cannot match ! therefore he fails.

Here is the debugging result of perl regex for failed:

 Matching REx "(?=.*\d)\w{4,}" against "!abcae20" 0 <> <!abcae20> | 1:IFMATCH[0](8) 0 <> <!abcae20> | 3: STAR(5) REG_ANY can match 8 times out of 2147483647... 8 <!abcae20> <> | 5: DIGIT(6) failed... 7 <!abcae2> <0> | 5: DIGIT(6) 8 <!abcae20> <> | 6: SUCCEED(0) subpattern success... 0 <> <!abcae20> | 8:CURLY {4,32767}(11) ALNUM can match 0 times out of 2147483647... failed... 
+2
source
 $ echo '!abcae20' | grep -o -P '(?=.*\d)\w{4,}' 

In this regular expression, lookahead (?=.*\d) catches !abcae2 at the beginning of the line itself, so it will try to execute mach for \w{4,} from the beginning of the line. But since it exists ! which does not match \w , full match fails

Probably the next regex will clear things up

 $ echo '!abcae20' | grep -o -P '(?=\w*\d)\w{4,}' abcae20 

Here lookahead only catches abcae2 , and the match starts with a , so the final match is abcae20


 $ echo '!abcae20' | grep -o -P '.*?(?=.*\d)\w{4,}' !abcae20 

In the regex above, you allow ! record first .*? therefore, full compliance.

+1
source

According to man pcrepattern :

If the template starts with .* Or .{0,} , and the PCRE_DOTALL parameter (Perl /s equivalent) is set, which allows the dots to match new lines, the template is implicitly fixed , because any subsequent one will be checked for every character position in the subject line, therefore it makes no sense to repeat the general match in any position after the first.

As mentioned in the manpage, this optimization cannot be used if .* Is inside the group in brackets, which is used as a backlink, since in this case there may be a point in the re-execution of a common post match. The same argument would mean that this optimization is not true for zero-length calls, as the pattern indicates in OP.

It is not visible from the manpage whether the .* In the lookahead has an implicit anchor, but it is certainly possible (although it will be a mistake, imho). For some reason, adding (?-s) , which I think would PCRE_DOTALL , would not change the behavior. However, the change .* To something else. In particular, changing this parameter to [^\d]* causes the regular expression to have the expected result:

 $ echo '!abcae20' | grep -P -o '(?=[^\d]*\d)\w{4,}' abcae20 

It is at least interesting that there are cases where the lookahead statement works, apparently, without creating an implicit anchor, which may raise some doubts about the above analysis. But it may just be an interaction with some other optimization. In particular,

 $ echo '!abcae20' | grep -P -o '(?=.*\d)a' a $ 

obviously could not work if the template was bound. On the other hand, changing a to [ab] , which, apparently, will not affect the match:

 $ echo '!abcae20' | grep -P -o '(?=.*\d)[ab]' $ 

(Many thanks to @perreal for an interesting discussion of this issue.)

Some of the observations that initially make me think this might be a mistake were:

 $ echo '!abcde20' | grep -P -o '(?=.*\d)\w*' abcde20 $ echo '!abcde20' | grep -P -o '(?=.*\d)\w+' $ echo '!abcde20' | grep -P -o '(?=.*\d)\w' $ echo '!abcde20' | grep -P -o '(?=.*\d)\w?' a b c d e 2 0 

Everything looks illogical, but it actually makes sense if the template is implicitly fixed. In the first and last case ( \w* and \w ), the template will correspond to an empty line at the beginning of input. grep -o then repeat the pattern at the next character position where it succeeds. In the other two cases ( \w+ and \w ), the bound pattern will fail, so grep will not repeat it.

However, I adhere to my claim that implicit pinning (if that's what happens) is a mistake, since the manpage is very clear that this optimization and optimization should not change behavior. (In addition, this contradicts the match (?=.*\d)a .) But it is possible that the error is indicated in the documentation, because - according to @perreal - Perl also rejects these matches, and pcre should be Perl-compatible.

+1
source

Reason for which:

 (?=.*\d)\w{4,} 

returns nothing due to the first part:

 (?=.*\d) 

which matches the whole expression and is positive. A positive forecast is a match whose value is not returned. For a better explanation see perldoc perlre

0
source

Source: https://habr.com/ru/post/958530/


All Articles