How to select specific columns containing specific lines / characters?

Question

How to select specific columns containing specific lines / characters?

I have this framework:

df1 <- data.frame(a = c("correct", "wrong", "wrong", "correct"), b = c(1, 2, 3, 4), c = c("wrong", "wrong", "wrong", "wrong"), d = c(2, 2, 3, 4)) abcd correct 1 wrong 2 wrong 2 wrong 2 wrong 3 wrong 3 correct 4 wrong 4

and would like to select only the columns with the rows “right” or “wrong” (that is, columns b and d in df1), so that I get this data frame:

 df2 <- data.frame(a = c("correct", "wrong", "wrong", "correct"), c = c("wrong", "wrong", "wrong", "wrong")) ac 1 correct wrong 2 wrong wrong 3 wrong wrong 4 correct wrong

Can dplyr be used for this? If not, what features can I use for this? The example I gave is simple, as I can just do it (dplyr):

 select(df1, a, c)

However, in my actual frame, I have about 700 variables / columns and several hundred columns that contain the rows “right” or “wrong”, and I don't know the names of the variables / columns.

Any suggestions on how to do this quickly? Thank you very much!

+6

r dataframe dplyr

hsl Apr 25 '15 at 12:47

source share

2 answers

---- update ----- Thank you, Colonel Bevel. What an elegant solution. I will use Filter more.

I want to check the speed solution too simply if time is an important factor:

 locator <- apply(df1, 2, function(x) grepl("correct|wrong", x)) index <- apply(locator, 2, any) newdf <- df1[,!index]

I have expanded your data frame to 500,000 columns:

 dftest <- as.data.frame(replicate(500000, df1[,1]))

Then we checked the system time for the function with apply , Filter with grepl and Filter with the pattern% in%:

 f <- function() { locator <- apply(dftest, 2, function(x) grepl("correct|wrong", x)) index <- apply(locator, 2, any) newdf <- dftest[,!index] } f1 <- function() {newdf <- (Filter(function(x) any(c("wrong", "correct") %in% x), dftest))} f2 <- function() {newdf <- Filter(function(u) any(grepl('wrong|correct',u)), dftest)} system.time(f()) user system elapsed 24.32 0.00 24.35 system.time(f1()) user system elapsed 2.31 0.00 2.34 system.time(f2()) user system elapsed 8.66 0.01 8.71

The Colonel’s decision is by far the best. It is clean and works best. --credit @akrun for the data.frame clause.

+2

Pierre lafortune Apr 25 '15 at 14:39

source share

Colonel beauvel · Accepted Answer · 2015-04-25T12:51:34+0000

You can use base R Filter , which will work in each df1 column and save all that satisfy the logical test in the function:

 Filter(function(u) any(c('wrong','correct') %in% u), df1) # ac #1 correct wrong #2 wrong wrong #3 wrong wrong #4 correct wrong

You can also use grepl :

 Filter(function(u) any(grepl('wrong|correct',u)), df1)

How to select specific columns containing specific lines / characters?

More articles: