R vs sed regex greed

I do not quite understand why this does not lead to "test" and it would be useful to explain:

 a = "blah test" sub('^.*(test|$)', '\\1', a) # [1] "" 

Compare it with sed expression:

 echo 'blah test' | sed -r 's/^.*(test|$)/\1/' # test echo 'blah blah' | sed -r 's/^.*(test|$)/\1/' # 

Fwiw, what I want in R is achieved (and is equivalent to the sed results above):

 sub('^.*(test)|^.*', '\\1', a) 
+4
source share
2 answers

The beginning of the regex engine matches all characters up to the end of the line, i.e. greedy .* , then tries to match (test|$) , that is, either the string literal 'test' or the end of the string. Since the first greedy match .* Matches all characters, it is a back-references character and then tries to match again (test|$) , here $ matches the end of the line.

The reason for your match result will be end of line character

I think sed uses POSIX NFA , which tries to find the longest match in Alternation, which is different from R , which seems to use Traditional NFA

+2
source

You need to mark ^.* As not greedy

 > sub('^.*?(test|$)', '\\1', "blah test") [1] "test" > sub('^.*?(test|$)', '\\1', "blah blah") [1] "" 
+5
source

Source: https://habr.com/ru/post/1492123/


All Articles