A regular expression that includes and excludes certain strings in R

Question

A regular expression that includes and excludes certain strings in R

I am trying to use R to parse multiple records. I have two requirements for the records I want to return. I want all entries to contain the word apple, but not the word orange.

For instance:

I like apples.
I really like apples
I like apples and oranges.

I want to return records 1 and 2.

How can I use R for this?

Thanks.

+6

regex r

janovak May 29 '14 at 21:55

source share

3 answers

Failed to do

 temp <- c("I like apples", "I really like apples", "I like apples and oranges") temp[grepl("apple", temp) & !grepl("orange", temp)] ## [1] "I like apples" "I really like apples"

+18

David Arenburg May 29 '14 at 21:59

source share

This regular expression is slightly smaller and much faster than other versions of regular expressions (see comparison below). I have no tools to compare with David double grepl , so if someone can compare the single grep below vs double grepl , which we can find out. A comparison should be made for both success and failure.

 ^(?!.*orange).*apple.*$

A negative look ensures that we do not have orange
We just match the string if it contains apple . No need to search there.

Code example

 grep("^(?!.*orange).*apple.*$", subject, perl=TRUE, value=TRUE);

Speed comparison

@hwnd has now uninstalled this dual version, but according to RegexBuddy the speed difference remains:

Against I like apples and oranges , the engine takes 22 steps to fail, versus 143 for the dual view ^(?=.*apple)((?!orange).)*$ And 22 steps for ^((?!.*orange).)*apple.*$ (equals there, but wait for point 2).
Against I really like apples , to achieve success, the engine performs 64 steps, against 104 for the dual-view version ^(?=.*apple)((?!orange).)*$ And 538 steps for ^((?!.*orange).)*apple.*$ .

These numbers were provided by the RegexBuddy debugger.

+8

zx81 May 29 '14 at 22:59

source share

hwnd · Accepted Answer · 2014-05-29T22:03:16+0000

Using a regular expression, you can do the following.

x <- c('I like apples', 'I really like apples', 'I like apples and oranges', 'I like oranges and apples', 'I really like oranges and apples but oranges more') x[grepl('^((?!.*orange).)*apple.*$', x, perl=TRUE)] # [1] "I like apples" "I really like apples"

The regular expression looks ahead to see if there is a character other than a line break and an orange substring, and if so, a period . will match any character except line break, as it is wrapped in a group, and repeated ( 0 or more times). Then we look for apple and any character except a line break ( 0 or more times). Finally, linear bindings begin and end to ensure that the input signal is consumed.

UPDATE You can use the following if performance is a problem.

 x[grepl('^(?!.*orange).*$', x, perl=TRUE)]

A regular expression that includes and excludes certain strings in R

More articles: