Regular Expression Uncertainty

The main question about the mechanics of regular expressions:

I have the following expression: [10]*1[10]* .

Will it fit 100 ?

My reasoning is:
first option: [10]* matches "100", and then comes to the end of the line => no match.
second option: [10]* ignored and the expression matches.

Am I forgetting something trivial, or will it depend on the mechanism of regular expressions? (I remember something about greedy, not greedy, but I'm not sure if this applies to this case)

+4
source share
3 answers

Regex engines come back.

The engine tries to match 100 with [10]* , but this does not work, because then 1 has nothing to do. But then the engine throws out the last repetition character (only using [10]* for 10 ) and tries again. Still not working because 1 does not match 0 . The engine will throw one character at a time until the first [10*] is completely removed. Now 1 matches and [10]* happily matches the rest.

I recommend reading this tutorial because it explains very much what is happening under the hood. (For your specific problem, see the "Repetition" section).

Additional Information:

It does not depend on whether the repetition is greedy or uneven. The regex engine will always return. It will start from the other end (with 0 occurrences [10] ) if you make it jagged as follows: [10]*? . In this case, this will speed up the process, because the first attempt will already match, but this will not change the fact that it always matches.

In fact, you can manually prevent the engine from rolling back by making repetition "possessive." If you do this and the repetition remains first, the engine will not try to repeat the other possible repetitions. This will be the syntax: [10]*+ . Now the engine will correspond to 100 only with this first part. Then match 1 will fail, but since you made the repetition possessive, it will not return to try different use cases [10]* . In this case, of course, it is useless, but there are cases when this behavior is desirable. And all this is also described in a related tutorial .;)

+2
source

Answer: yes, this is appropriate, because the regular expression parser will consume as much from each subexpression as it needs to achieve compliance for the entire expression.

In your case, to match, it will do the following:

  • first [10]* will consume null characters
  • then it will match literal 1
  • then the last [10]* will use the remaining input


Finally, instead of asking here, why not try it on regexpal and see for yourself!

+2
source

This is easy enough to verify. Here is a little php script:

 <?php if (preg_match('/[10]*1[10]*/', '100')) { echo "It matches.\n"; } else { echo "It doesn't match.\n"; } ?> 

And the result:

 It matches. 

Explanation: after some trial and reverse tracking of the regular expression engine, the final result is that the first [10]* does not match. 1 corresponds to text 1 , and the second [10]* corresponds to text 00 .

+1
source

Source: https://habr.com/ru/post/1442157/


All Articles