Use perl = TRUE regex in dplyr select

How to select cols using perl = TRUE as regex.

 data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% dplyr::select(matches("(?i)b(?!a)")) 

Error in grep (needle, haystack, ...): invalid regular expression '(? I) b (?! A)', reason "Invalid regular expression"

regex is really valid.

 grep("(?i)b(?!a)",c("baa","boo","boa","lol","bAa"),perl=T) > [1] 2 3 

Is there a quick access function / way?

+5
source share
3 answers

matches in dplyr does not support perl = TRUE . However, you can perform your own functions. After a little digging in the source code, this works:

Quick way:

 library(dplyr) #notice the 3 colons because grep_vars is not exported from dplyr matches2 <- function (match, ignore.case = TRUE, vars = current_vars()) { dplyr:::grep_vars(match, vars, ignore.case = ignore.case, perl = TRUE) } data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% select(matches2("(?i)b(?!a)")) #boo boa #1 0 0 

Or a more explanatory solution:

 matches2 <- function (match, ignore.case = TRUE, vars = current_vars()) { grep_vars2(match, vars, ignore.case = ignore.case) } #this is pretty much my only change in the original dplyr:::grep_vars #to make it accept perl. grep_vars2 <- function (needle, haystack, ...) { grep(needle, haystack, perl = TRUE, ...) } data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% select(matches2("(?i)b(?!a)")) #boo boa #1 0 0 
+8
source

Another approach, although line by line and probably more dangerous than Lizande's proposal:

 body(matches)[[grep("grep_vars", body(matches))]] <- substitute(grep_vars(match, vars, ignore.case = ignore.case, perl=T)) data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) %>% dplyr::select(matches("(?i)b(?!a)")) boo boa 1 0 0 

I would not use body(matches)[[3]] , since any updates can cause problems with this small patch.

+1
source

As an amendment / note to LyzandeRs, answer here a version that does not use the dplyr dictionary, only the magrittr pipe. Therefore, writing shell functions and specifying arguments, etc. May be skipped.

This is a bit more verbose than dplyr . But it is less detailed than base and allows you to use the full flexibility of any function, such as grep or stringi::stri_detect , etc.

And it is much faster. Check below benchmarks. Of course, it should be noted that the speed should be checked for larger examples, the dplyr overhead is large enough for this small example, therefore, a fair comparison of the speed depends on the use case.

 df <- data.frame(baa=0,boo=0,boa=0,lol=0,bAa=0) library(magrittr) df %>% .[,grep("(?i)b(?!a)", names(.), perl = T)] # boo boa # 1 0 0 #in the following a copy of LyzanderRs approaches library(dplyr) matches2 <- function (match, ignore.case = TRUE, vars = current_vars()) { dplyr:::grep_vars(match, vars, ignore.case = ignore.case, perl = TRUE) } grep_vars2 <- function (needle, haystack, ...) { grep(needle, haystack, perl = TRUE, ...) } matches3 <- function (match, ignore.case = TRUE, vars = current_vars()) { grep_vars2(match, vars, ignore.case = ignore.case) } library(microbenchmark) microbenchmark( df %>% select(matches2("(?i)b(?!a)")), df %>% select(matches3("(?i)b(?!a)")), df %>% .[,grep("(?i)b(?!a)", names(.), perl = T)] ) # Unit: microseconds # expr min lq mean median uq max neval # df %>% select(matches2("(?i)b(?!a)")) 3994.867 4309.877 4570.6414 4555.8065 4726.9310 6618.769 100 # df %>% select(matches3("(?i)b(?!a)")) 3981.841 4177.834 4792.2025 4396.3275 4655.6780 31812.876 100 # df %>% .[, grep("(?i)b(?!a)", names(.), perl = T)] 183.164 210.797 242.1678 237.2455 263.6935 554.624 100 
+1
source

Source: https://habr.com/ru/post/1274196/


All Articles