Here's why the capture group is empty in PCRE:
Initial state
(.*)* abc ^ ^
First, the (.*)
Part is mapped to abc
, and the input position advances to the end. The capture group contains abc
at this point.
(.*)* abc ^ ^
Now the input position is after the character c
, the remaining one is an empty string. Star Kleene initiates a second group matching attempt (.*)
:
(.*)* abc ^ ^
The group (.*)
Matches the empty line after abc
. Since it matches, the previously recorded line has been overwritten .
Since the input position has not progressed, *
completes the iteration, and the match succeeds.
The difference in behavior between JS and PCRE is due to how regular expression engines are defined. PCRE behavior is compatible with Perl:
PCRE:
$ pcretest PCRE version 8.39 2016-06-14 re> /(.*)*/ data> abc 0: abc 1:
Perl:
$ perl -e '"abc" =~ /(.*)*/; print "<$&> <$1>\n";' <abc> <>
Let it compare this with .NET , which has the same behavior but supports multiple captures:

When the capture group is mapped a second time, .NET will add the captured value to the capture stack. Perl and PCRE just overwrite it.
Like for JavaScript:
Here ECMA-262 Β§21.2.2.5.1 Runtime semantics: RepeatMatcher Abstract operation:
The RepeatMatcher abstract operation takes eight parameters: Matcher m
, integer min
, integer (or β) max
, boolean greedy
, state x
, continuation c
, integer parenIndex
and integer parenCount
, and performs the following steps:
- If
max
is zero, return c(x)
. - Create an inner closure for the continuation
d
that takes one state argument y
and performs the following steps when evaluating:- a. If
min
is zero and y
endIndex
is x
endIndex
, return failure
. - b. If
min
is zero, then min2
is zero; otherwise let min2
be minβ1
. - from. If
max
- β, let max2
- β; otherwise, let max2
be maxβ1
. - e. Call
RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount)
and return its result.
- Let
cap
be a fresh copy of x
displaying a list. - For every integer
k
satisfying parenIndex < k
and k β€ parenIndex+parenCount
, set cap[k]
to undefined
. - Let
e
be x
endIndex. - Let
xr
be the state (e, cap)
. - If
min
not equal to zero, return m(xr, d)
. - If
greedy
is false
, then- a. Call
c(x)
and let z
be its result. - b. If
z
not failure
, return z
. - from. Call
m(xr, d)
and return its result.
- Call
m(xr, d)
and let z
be its result. - If
z
not failure
, return z
. - Call
c(x)
and return its result.
This is basically a definition of what should happen when evaluating a quantifier. RepeatMatcher
- operation processing the correspondence of the internal operation m
.
You also need to understand what a state is (Β§21.2.2.1, my emphasis):
State is an ordered pair ( endIndex
, captures
), where endIndex
is an integer, and capture is a list of NcapturingParens
values. States are used to represent partial matching states in regular expression matching algorithms. endIndex
is the plus plus index of the last input character so far matched by the pattern , while captures
contains the results of capturing parentheses. The n
th element of captures
is either a List that represents the value obtained by the n
th set for sliding brackets or undefined if the n
th set of brackets has not yet been reached. Due to indentation, many States may be used at any time in the negotiation process.
For your example, RepeatMatcher
options:
m
: Matches responsible for processing the subpattern (.*)
min
: 0 (minimum number of matches for the Klein star quantifier)max
: β (maximum number of matches for the Klein star quantifier)greedy
: true (the Klein star quantifier used is greedy)x
: (0, [undefined])
(see definition of state above)c
: continued, at this point it will be: continued, which always returns its state argument as a successful MatchResult
after folding the parent rulesparenIndex
: 0 (according to Β§ 21.2.2.5 this is the number of left brackets for the brackets in the entire regular expression, which is to the left of this production)parenCount
: 1 (the same spec paragraph, this is the number of left parentheses in the expansion of this Atom product - I will not insert the full specification here, but it basically means that m
defines one capture group)
The regular expression matching algorithm in spec is defined in terms of a continuation style . This basically means that operation c
means what needs to happen next.
Let me deploy this algorithm.
First iteration
On the first pass, the state x
1 is (0, [undefined])
.
max
not zero- At this moment, we create a closure of the continuation
d
1 , it will be used in the second pass, so we will return to this later. - Copy
cap
1 capture list - Reset capture in
cap
1 to undefined
, this is non-op in the first pass - Let
e
1 = 0 - Let
xr
1 = ( e
1 , cap
1 ) min
is zero, skip this stepgreedy
true, skip this step- Let
z
1 = m
( xr
, d
1 )) - this evaluates the subpattern (.*)
Now let's go back a little - m
will match (.*)
Against abc
, and then call our closure d
1 , so let me expand this one.
d
1 is evaluated with the state y
1 = (3, ["abc"])
:
min
is 0, but y
1 endIndex
is 3 and x
1 endIndex
is 0, so do not return failure
- Let
min2
= min
= 0, since min
= 0 - Let
max2
= max
= β, since max
= β - Call
RepeatMatcher(m, min2, max2, greedy, y, c, parenIndex, parenCount)
and return its result. That is: RepeatMatcher (m, 0, β, false, y 1 , c, 0, 1)
Second iteration
So, right now we are looking for a second iteration with x
2 = y
1 = (3, ["abc"])
.
max
not zero- At this moment, we create the closure of the continuation of
d
2 - Make a copy of
cap
2 capture list ["abc"]
- Reset capture in
cap
2 to undefined
, get cap
2 = [undefined]
- Let
e
2 = 3 - Let
xr
2 = ( e
2 , cap
2 ) min
is zero, skip this stepgreedy
true, skip this stepLet z
2 = m
( xr
2 , d
2 )) - this evaluates the subpattern (.*)
This time m
will match the empty string after abc
and cause our closure d
2 with this. Evaluate what d
2 does. Its parameter y
2 = (3, [""])
min
is still 0, but y
2 endIndex
is 3 and x
2 endIndex
also 3 (remember that this time x
is the state y
from the previous iteration), the closure simply returns failure
.
z
2 is failure
, skip this step- return
c
( x
2 ), which is at this iteration c((3, ["abc"]))
.
c
just returns a valid MatchResult here when we are at the end of the pattern. This means that d
1 returns this result, and the first iteration returns it along with step 10.
Basically, as you can see, the specification line that causes the JS behavior is different from PCRE, the following:
a. If min
is zero and y
endIndex
is x
endIndex
, return failure
.
In conjunction with:
- Call
c(x)
and return its result.
Returns previously committed values ββif the iteration fails.