R vs sed regex greed

Question

R vs sed regex greed

I do not quite understand why this does not lead to "test" and it would be useful to explain:

 a = "blah test" sub('^.*(test|$)', '\\1', a) # [1] ""

Compare it with sed expression:

 echo 'blah test' | sed -r 's/^.*(test|$)/\1/' # test echo 'blah blah' | sed -r 's/^.*(test|$)/\1/' #

Fwiw, what I want in R is achieved (and is equivalent to the sed results above):

 sub('^.*(test)|^.*', '\\1', a)

+4

regex r sed

eddi Jul 18 '13 at 15:40

source share

2 answers

You need to mark ^.* As not greedy

 > sub('^.*?(test|$)', '\\1', "blah test") [1] "test" > sub('^.*?(test|$)', '\\1', "blah blah") [1] ""

+5

GSee Jul 18 '13 at 16:10

source share

Akash · Accepted Answer · 2013-07-18T16:36:25+0000

The beginning of the regex engine matches all characters up to the end of the line, i.e. greedy .* , then tries to match (test|$) , that is, either the string literal 'test' or the end of the string. Since the first greedy match .* Matches all characters, it is a back-references character and then tries to match again (test|$) , here $ matches the end of the line.

The reason for your match result will be end of line character

I think sed uses POSIX NFA , which tries to find the longest match in Alternation, which is different from R , which seems to use Traditional NFA

R vs sed regex greed

More articles: