Match / find strings based on multiple required values ​​in one string in R

It must be a duplicate, but I cannot find it. So here.

I have a data.frame with two columns. One contains a group and the other contains a criterion. A group can contain many different criteria, but only one line. I want to identify groups that contain three specific criteria (but this will appear on different lines. In my case, I want to identify all groups that contain criteria "I", "E", "C." Groups can contain any number and combination these and several other letters.

test <- data.frame(grp=c(1,1,2,2,2,3,3,3,4,4,4,4,4),val=c("C","I","E","I","C","E","I","A","C","I","E","E","A")) > test grp val 1 1 C 2 1 I 3 2 E 4 2 I 5 2 C 6 3 E 7 3 I 8 3 A 9 4 C 10 4 I 11 4 E 12 4 E 13 4 A 

In the above example, I want to define grp 2 and 4 because each of them contains the letters E, I, and C.

Thanks!

+5
source share
2 answers

Here is a dplyr solution. %in% vectorized, so c("E", "I", "C") %in% val returns a logical vector of length three. For target groups, passing this vector to all() returns TRUE . This is our filter, and we run it inside each group using group_by() .

 library(dplyr) test %>% group_by(grp) %>% filter(all(c("E", "I", "C") %in% val)) # Source: local data frame [8 x 2] # Groups: grp [2] # # grp val # (dbl) (fctr) # 1 2 E # 2 2 I # 3 2 C # 4 4 C # 5 4 I # 6 4 E # 7 4 E # 8 4 A 

Or, if this result is more convenient (thanks @Frank),

 test %>% group_by(grp) %>% summarise(matching = all(c("E", "I", "C") %in% val)) # Source: local data frame [4 x 2] # # grp matching # (dbl) (lgl) # 1 1 FALSE # 2 2 TRUE # 3 3 FALSE # 4 4 TRUE 
+2
source
 library(data.table) test <- data.frame(grp=c(1,1,2,2,2,3,3,3,4,4,4,4,4),val=c("C","I","E","I","C","E","I","A","C","I","E","E","A")) setDT(test) # convert the data.frame into a data.table group.counts <- dcast(test, grp ~ val) # count number of same values per group and create one column per val with the count in the cell group.counts[I>0 & E>0 & C>0,] # now filtering is easy 

Results in:

  grp ACEI 1: 2 0 1 1 1 2: 4 1 1 2 1 

Instead of returning group numbers, you can also β€œattach” the received group numbers to the original data to show the raw data lines of each group that correspond to:

 test[group.counts[I>0 & E>0 & C>0,], .SD, on="grp" ] 

It shows:

  grp val 1: 2 E 2: 2 I 3: 2 C 4: 4 C 5: 4 I 6: 4 E 7: 4 E 8: 4 A 

PS: just to make the solution easier to understand: calculations for all groups:

 > group.counts grp ACEI 1: 1 0 1 0 1 2: 2 0 1 1 1 3: 3 1 0 1 1 4: 4 1 1 2 1 
+2
source

Source: https://habr.com/ru/post/1244838/


All Articles