R efficient data filtering (user-defined variable filters)

Question

R efficient data filtering (user-defined variable filters)

I want to filter the dataframe, original.data, in R. There can be about 1-2 million cases in a data frame. There are several fields in the data frame, and the names may differ. The user can choose which fields to filter. These field names are stored in names (all.filters), where all.filters is a variable-length list. Then the user can select the levels for each of the fields in the names (all.filters). For example, this list might look something like this:

> all.filters $Period [1] "2010-12-31" "2011-03-31" "2011-06-30" "2011-09-30" "2011-12-31" [6] "2012-03-31" "2012-06-30" "2012-09-30" $Size [1] "L" "VL" $Number [1] "11" "21" "35" "42" "45" "47" "49" "52" "57"

I use the following code to apply the selected filters:

 attach(original.data) filter.names <- names(all.filters) flag <- 1 for(filter in filter.names){ flag <- flag*(is.element(get(filter),all.filters[[filter]])) } filtered.data <- original.data[flag==1,]

It works, but it feels a little slower. Note that get (filter) retrieves the column with the source .data with the column name equal to the filter. I'm not sure if this is a good way to filter data, but the variable nature of all.filters limits my options a bit - I wanted to use a subset, but I'm not sure what to use as the select argument. I would like to make this filtering step as fast as possible so that when the user updates the filter selection, the data can be built quickly.

Once the data is filtered out, I use reshape2 to summarize the data before plotting it with ggplot2. I think it might be more efficient to apply filters at one of these steps, if possible.

Any suggestions are welcome.

+4

performance r filtering

josh Oct 24 '12 at 22:11

source share

3 answers

A slightly more general approach that does not depend on hard-coding field names: suppose your data.frame and your filters have the same columns / fields in the same order:

 foo <- data.frame(Period=sample(x=c("2010-12-31","2011-01-01"),size=100,replace=TRUE), Size=sample(x=c("S","L","VL"),size=100,replace=TRUE), Number=sample(x=c("9","11","21"),size=100,replace=TRUE)) all.filters <- list( Period=c("2010-12-31","2011-03-31"), Size=c("L","VL"), Number=c("11","21","35"))

Then we need to apply %in% to the first column in foo with respect to the first filter entry, to the second column against the second record, etc.:

 bar <- mapply(FUN='%in%',foo,all.filters)

Finally, we extract those lines from foo where all the filters match:

 foo[apply(bar,1,all),]

+1

Stephan kolassa Oct 24 '12 at 10:38

source share

It looks like you need data matching ANY from the filter options? So, "L" or "VL" , regardless of the period, for example?

In this case, I would simply do:

 Filtered.Data <- subset(original.data, Period %in% all.filters$Period | Size %in% all.filters$Size | Number %in% all.filters$Number)

No need to take a lot of time. If you need data that matches all of these values, replace | on & . If you have many categories of filters, you can make a for and rbind , which is crap.

0

Señor o Oct 24 '12 at 22:30

source share

mnel · Accepted Answer · 2012-10-24T23:33:57+0000

You can use data.table with corresponding key sets. It will be memory efficient.

Then you can pass your list filters to component i [.data.table

 .period <- seq(from = as.Date("2010/1/1", "%Y/%m/%d"), to = as.Date("2012/1/1", "%Y/%m/%d"), by = "3 months") .size <- c("XS", "S", "M", "L", "XL") .number <- as.character(1:100) DF <- expand.grid(Period = .period, Size = .size, Number = .number, stringsAsFactors = F) DF$other <- rnorm(nrow(DF)) library(data.table) DT <- as.data.table(DF) DT[, `:=`(Period, as.IDate(.period))] ## Period Size Number other ## 1: 2010-01-01 XS 1 0.17947 ## 2: 2010-04-01 XS 1 1.43252 ## 3: 2010-07-01 XS 1 -0.97142 ## 4: 2010-10-01 XS 1 -0.98021 ## 5: 2011-01-01 XS 1 -0.62964 ## --- ## 4496: 2011-01-01 XL 100 0.65831 ## 4497: 2011-04-01 XL 100 -0.45277 ## 4498: 2011-07-01 XL 100 -0.14236 ## 4499: 2011-10-01 XL 100 -0.02376 ## 4500: 2012-01-01 XL 100 -0.11525 all_filters <- list(Period = as.IDate(as.Date("2010/1/1", format = "%Y/%m/%d")), Size = "L", Number = c("11", "21", "35", "42", "45", "47", "49", "52", "57")) setkeyv(DT, names(all_filters)) DT[all_filters] ## Period Size Number other ## 1: 2010-01-01 L 11 1.4122 ## 2: 2010-01-01 L 21 -0.4923 ## 3: 2010-01-01 L 35 1.1262 ## 4: 2010-01-01 L 42 1.3527 ## 5: 2010-01-01 L 45 -0.3758 ## 6: 2010-01-01 L 47 -0.1847 ## 7: 2010-01-01 L 49 -0.8503 ## 8: 2010-01-01 L 52 -1.0645 ## 9: 2010-01-01 L 57 -0.6092

The only problem I see is that each time you have to reset use the key to make sure that you are referencing the correct columns. In addition, you will need to make sure that the filter identifiers are the same class as the columns in data.frame - it’s easier to work with character not factor columns

EDIT

To filter more than more than one column, use CJ . CJ is a cross join (equivalent to data.table for expand.grid with a set of keys)

 all_filters <- list(Period = as.IDate(as.Date("2010/1/1", format = "%Y/%m/%d")), Size = c("L",'XL'), Number = c("11", "21", "35", "42", "45", "47", "49", "52", "57")) cj_filter <- do.call(CJ, all_filters) # note you could avoid this `do.call` line by # cj_filter <- CJ(Period = as.IDate(as.Date("2010/1/1", format = "%Y/%m/%d")), Size = c("L",'XL'), Number = c("11", "21", "35", "42", "45", "47", "49", "52", "57")) setkeyv(DT, names(cj_filter)) DT[cj_filter] Period Size Number other 1: 2010-01-01 L 11 0.36289104 2: 2010-01-01 L 21 1.26356767 3: 2010-01-01 L 35 -0.18629723 4: 2010-01-01 L 42 0.92267902 5: 2010-01-01 L 45 1.68796072 6: 2010-01-01 L 47 1.75107447 7: 2010-01-01 L 49 0.24048407 8: 2010-01-01 L 52 0.06675221 9: 2010-01-01 L 57 0.49665392 10: 2010-01-01 XL 11 0.33682495 11: 2010-01-01 XL 21 0.67642271 12: 2010-01-01 XL 35 -0.16412768 13: 2010-01-01 XL 42 0.72863394 14: 2010-01-01 XL 45 -0.55527588 15: 2010-01-01 XL 47 1.30850591 16: 2010-01-01 XL 49 1.08688166 17: 2010-01-01 XL 52 -0.31157250 18: 2010-01-01 XL 57 0.43626422

You can also do

  setkeyv(DT, names(all_filters)) DT[do.call(CJ,all_filters)]

R efficient data filtering (user-defined variable filters)

EDIT

More articles: