How to select specific columns containing specific lines / characters?

I have this framework:

df1 <- data.frame(a = c("correct", "wrong", "wrong", "correct"), b = c(1, 2, 3, 4), c = c("wrong", "wrong", "wrong", "wrong"), d = c(2, 2, 3, 4)) abcd correct 1 wrong 2 wrong 2 wrong 2 wrong 3 wrong 3 correct 4 wrong 4 

and would like to select only the columns with the rows “right” or “wrong” (that is, columns b and d in df1), so that I get this data frame:

 df2 <- data.frame(a = c("correct", "wrong", "wrong", "correct"), c = c("wrong", "wrong", "wrong", "wrong")) ac 1 correct wrong 2 wrong wrong 3 wrong wrong 4 correct wrong 

Can dplyr be used for this? If not, what features can I use for this? The example I gave is simple, as I can just do it (dplyr):

 select(df1, a, c) 

However, in my actual frame, I have about 700 variables / columns and several hundred columns that contain the rows “right” or “wrong”, and I don't know the names of the variables / columns.

Any suggestions on how to do this quickly? Thank you very much!

+6
source share
2 answers

You can use base R Filter , which will work in each df1 column and save all that satisfy the logical test in the function:

 Filter(function(u) any(c('wrong','correct') %in% u), df1) # ac #1 correct wrong #2 wrong wrong #3 wrong wrong #4 correct wrong 

You can also use grepl :

 Filter(function(u) any(grepl('wrong|correct',u)), df1) 
+8
source

---- update ----- Thank you, Colonel Bevel. What an elegant solution. I will use Filter more.

I want to check the speed solution too simply if time is an important factor:

 locator <- apply(df1, 2, function(x) grepl("correct|wrong", x)) index <- apply(locator, 2, any) newdf <- df1[,!index] 

I have expanded your data frame to 500,000 columns:

 dftest <- as.data.frame(replicate(500000, df1[,1])) 

Then we checked the system time for the function with apply , Filter with grepl and Filter with the pattern% in%:

 f <- function() { locator <- apply(dftest, 2, function(x) grepl("correct|wrong", x)) index <- apply(locator, 2, any) newdf <- dftest[,!index] } f1 <- function() {newdf <- (Filter(function(x) any(c("wrong", "correct") %in% x), dftest))} f2 <- function() {newdf <- Filter(function(u) any(grepl('wrong|correct',u)), dftest)} system.time(f()) user system elapsed 24.32 0.00 24.35 system.time(f1()) user system elapsed 2.31 0.00 2.34 system.time(f2()) user system elapsed 8.66 0.01 8.71 

The Colonel’s decision is by far the best. It is clean and works best. --credit @akrun for the data.frame clause.

+2
source

Source: https://habr.com/ru/post/985942/


All Articles