A regular expression that includes and excludes certain strings in R

I am trying to use R to parse multiple records. I have two requirements for the records I want to return. I want all entries to contain the word apple, but not the word orange.

For instance:

  • I like apples.
  • I really like apples
  • I like apples and oranges.

I want to return records 1 and 2.

How can I use R for this?

Thanks.

+6
source share
3 answers

Using a regular expression, you can do the following.

x <- c('I like apples', 'I really like apples', 'I like apples and oranges', 'I like oranges and apples', 'I really like oranges and apples but oranges more') x[grepl('^((?!.*orange).)*apple.*$', x, perl=TRUE)] # [1] "I like apples" "I really like apples" 

The regular expression looks ahead to see if there is a character other than a line break and an orange substring, and if so, a period . will match any character except line break, as it is wrapped in a group, and repeated ( 0 or more times). Then we look for apple and any character except a line break ( 0 or more times). Finally, linear bindings begin and end to ensure that the input signal is consumed.


UPDATE You can use the following if performance is a problem.

 x[grepl('^(?!.*orange).*$', x, perl=TRUE)] 
+12
source

Failed to do

 temp <- c("I like apples", "I really like apples", "I like apples and oranges") temp[grepl("apple", temp) & !grepl("orange", temp)] ## [1] "I like apples" "I really like apples" 
+18
source

This regular expression is slightly smaller and much faster than other versions of regular expressions (see comparison below). I have no tools to compare with David double grepl , so if someone can compare the single grep below vs double grepl , which we can find out. A comparison should be made for both success and failure.

 ^(?!.*orange).*apple.*$ 
  • A negative look ensures that we do not have orange
  • We just match the string if it contains apple . No need to search there.

Code example

 grep("^(?!.*orange).*apple.*$", subject, perl=TRUE, value=TRUE); 

Speed ​​comparison

@hwnd has now uninstalled this dual version, but according to RegexBuddy the speed difference remains:

  • Against I like apples and oranges , the engine takes 22 steps to fail, versus 143 for the dual view ^(?=.*apple)((?!orange).)*$ And 22 steps for ^((?!.*orange).)*apple.*$ (equals there, but wait for point 2).
  • Against I really like apples , to achieve success, the engine performs 64 steps, against 104 for the dual-view version ^(?=.*apple)((?!orange).)*$ And 538 steps for ^((?!.*orange).)*apple.*$ .

These numbers were provided by the RegexBuddy debugger.

+8
source

Source: https://habr.com/ru/post/970059/


All Articles