Perl vs Javascript regular expressions

Why does the following regex capture (via the capture group) the string "abc" in Javascript, but not in PCRE (although it will still match)?

(.*)*

+6
source share
1 answer

Here's why the capture group is empty in PCRE:

  • Initial state

     (.*)* abc ^ ^ 
  • First, the (.*) Part is mapped to abc , and the input position advances to the end. The capture group contains abc at this point.

     (.*)* abc ^ ^ 
  • Now the input position is after the character c , the remaining one is an empty string. Star Kleene initiates a second group matching attempt (.*) :

     (.*)* abc ^ ^ 
  • The group (.*) Matches the empty line after abc . Since it matches, the previously recorded line has been overwritten .

  • Since the input position has not progressed, * completes the iteration, and the match succeeds.

The difference in behavior between JS and PCRE is due to how regular expression engines are defined. PCRE behavior is compatible with Perl:

PCRE:

 $ pcretest PCRE version 8.39 2016-06-14 re> /(.*)*/ data> abc 0: abc 1: 

Perl:

 $ perl -e '"abc" =~ /(.*)*/; print "<$&> <$1>\n";' <abc> <> 

Let it compare this with .NET , which has the same behavior but supports multiple captures:

.NET regex

When the capture group is mapped a second time, .NET will add the captured value to the capture stack. Perl and PCRE just overwrite it.


Like for JavaScript:

Here ECMA-262 Β§21.2.2.5.1 Runtime semantics: RepeatMatcher Abstract operation:

The RepeatMatcher abstract operation takes eight parameters: Matcher m , integer min , integer (or ∞) max , boolean greedy , state x , continuation c , integer parenIndex and integer parenCount , and performs the following steps:

  • If max is zero, return c(x) .
  • Create an inner closure for the continuation d that takes one state argument y and performs the following steps when evaluating:
    • a. If min is zero and y endIndex is x endIndex , return failure .
    • b. If min is zero, then min2 is zero; otherwise let min2 be min‑1 .
    • from. If max - ∞, let max2 - ∞; otherwise, let max2 be max‑1 .
    • e. Call RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result.
  • Let cap be a fresh copy of x displaying a list.
  • For every integer k satisfying parenIndex < k and k ≀ parenIndex+parenCount , set cap[k] to undefined .
  • Let e be x endIndex.
  • Let xr be the state (e, cap) .
  • If min not equal to zero, return m(xr, d) .
  • If greedy is false , then
    • a. Call c(x) and let z be its result.
    • b. If z not failure , return z .
    • from. Call m(xr, d) and return its result.
  • Call m(xr, d) and let z be its result.
  • If z not failure , return z .
  • Call c(x) and return its result.

This is basically a definition of what should happen when evaluating a quantifier. RepeatMatcher - operation processing the correspondence of the internal operation m .

You also need to understand what a state is (Β§21.2.2.1, my emphasis):

State is an ordered pair ( endIndex , captures ), where endIndex is an integer, and capture is a list of NcapturingParens values. States are used to represent partial matching states in regular expression matching algorithms. endIndex is the plus plus index of the last input character so far matched by the pattern , while captures contains the results of capturing parentheses. The n th element of captures is either a List that represents the value obtained by the n th set for sliding brackets or undefined if the n th set of brackets has not yet been reached. Due to indentation, many States may be used at any time in the negotiation process.

For your example, RepeatMatcher options:

  • m : Matches responsible for processing the subpattern (.*)
  • min : 0 (minimum number of matches for the Klein star quantifier)
  • max : ∞ (maximum number of matches for the Klein star quantifier)
  • greedy : true (the Klein star quantifier used is greedy)
  • x : (0, [undefined]) (see definition of state above)
  • c : continued, at this point it will be: continued, which always returns its state argument as a successful MatchResult after folding the parent rules
  • parenIndex : 0 (according to Β§ 21.2.2.5 this is the number of left brackets for the brackets in the entire regular expression, which is to the left of this production)
  • parenCount : 1 (the same spec paragraph, this is the number of left parentheses in the expansion of this Atom product - I will not insert the full specification here, but it basically means that m defines one capture group)

The regular expression matching algorithm in spec is defined in terms of a continuation style . This basically means that operation c means what needs to happen next.

Let me deploy this algorithm.

First iteration

On the first pass, the state x 1 is (0, [undefined]) .

  • max not zero
  • At this moment, we create a closure of the continuation d 1 , it will be used in the second pass, so we will return to this later.
  • Copy cap 1 capture list
  • Reset capture in cap 1 to undefined , this is non-op in the first pass
  • Let e 1 = 0
  • Let xr 1 = ( e 1 , cap 1 )
  • min is zero, skip this step
  • greedy true, skip this step
  • Let z 1 = m ( xr , d 1 )) - this evaluates the subpattern (.*)

Now let's go back a little - m will match (.*) Against abc , and then call our closure d 1 , so let me expand this one.

d 1 is evaluated with the state y 1 = (3, ["abc"]) :

  • min is 0, but y 1 endIndex is 3 and x 1 endIndex is 0, so do not return failure
  • Let min2 = min = 0, since min = 0
  • Let max2 = max = ∞, since max = ∞
  • Call RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount) and return its result. That is: RepeatMatcher (m, 0, ∞, false, y 1 , c, 0, 1)

Second iteration

So, right now we are looking for a second iteration with x 2 = y 1 = (3, ["abc"]) .

  • max not zero
  • At this moment, we create the closure of the continuation of d 2
  • Make a copy of cap 2 capture list ["abc"]
  • Reset capture in cap 2 to undefined , get cap 2 = [undefined]
  • Let e 2 = 3
  • Let xr 2 = ( e 2 , cap 2 )
  • min is zero, skip this step
  • greedy true, skip this step
  • Let z 2 = m ( xr 2 , d 2 )) - this evaluates the subpattern (.*)

    This time m will match the empty string after abc and cause our closure d 2 with this. Evaluate what d 2 does. Its parameter y 2 = (3, [""])

    min is still 0, but y 2 endIndex is 3 and x 2 endIndex also 3 (remember that this time x is the state y from the previous iteration), the closure simply returns failure .

  • z 2 is failure , skip this step
  • return c ( x 2 ), which is at this iteration c((3, ["abc"])) .

c just returns a valid MatchResult here when we are at the end of the pattern. This means that d 1 returns this result, and the first iteration returns it along with step 10.

Basically, as you can see, the specification line that causes the JS behavior is different from PCRE, the following:

a. If min is zero and y endIndex is x endIndex , return failure .

In conjunction with:

  1. Call c(x) and return its result.

Returns previously committed values ​​if the iteration fails.

+8
source

Source: https://habr.com/ru/post/1012450/


All Articles