R conditional grouping in data.table

Question

R conditional grouping in data.table

In R, I have a big data table. For each line, I want to count lines with the same value x1 (+/- some tolerance, tol). I can make it work profitably, but it's too slow. It seems that data.table data would be good - in fact, I already use data.table for the calculation part.

Is there a way to do this completely with data.table? Here is an example:

library(data.table) library(plyr) my.df = data.table(x1 = 1:1000, x2 = 4:1003) tol = 3 adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N])

Results:

  x1 x2 V1 1: 1 4 3 2: 2 5 4 3: 3 6 5 4: 4 7 5 5: 5 8 5 --- 996: 996 999 5 997: 997 1000 5 998: 998 1001 5 999: 999 1002 4 1000: 1000 1003 3

Update:

Here is an example dataset that is a bit closer to my real data:

 set.seed(10) x = seq(1,100000000,100000) x = x + sample(1:50000, length(x), replace=T) x2 = x + sample(1:50000, length(x), replace=T) my.df = data.table(x1 = x, x2 = x2) setkey(my.df,x1) tol = 100000 og = function(my.df) { adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N]) } microbenchmark(r_ed <- ed(copy(my.df)), r_ar <- ar(copy(my.df)), r_og <- og(copy(my.df)), times = 1) Unit: milliseconds expr min lq median uq max neval r_ed <- ed(copy(my.df)) 8.553137 8.553137 8.553137 8.553137 8.553137 1 r_ar <- ar(copy(my.df)) 10.229438 10.229438 10.229438 10.229438 10.229438 1 r_og <- og(copy(my.df)) 1424.472844 1424.472844 1424.472844 1424.472844 1424.472844 1

Obviously, solutions from @eddi and @Arun are much faster than mine. Now I just need to try to understand the videos.

+6

r data.table grouping

benjamin Aug 08 '13 at 6:53

source share

4 answers

See @eddi's answer for a quicker solution (for this specific problem). It also works when `x1` not an integer.

The algorithm you are looking for is Interval Tree . And there is a bioconductor package called IRanges that performs this task. It's hard to beat.

 require(IRanges) require(data.table) my.df[, res := countOverlaps(IRanges(my.df$x1, width=1), IRanges(my.df$x1-tol+1, my.df$x1+tol-1))]

Some explanation:

If you add the code, you can write it in three lines:

 ir1 <- IRanges(my.df$x1, width=1) ir2 <- IRanges(my.df$x1-tol+1, my.df$x1+tol-1) cnt <- countOverlaps(ir1, ir2)

Essentially, we have to create two “ranges” (just enter ir1 and ir2 to see how they are). Then we ask, for each record in ir1 , how much they overlap in ir2 (this is part of the interval tree). And it is very effective. The type argument is countOverlaps , the default is type = any. You can learn other types if you want. This is extremely helpful. The findOverlaps function is also relevant.

Note: there may be faster solutions (in fact, see @eddi) for this particular case, where the width is ir1 = 1. But for tasks where the widths are variable and / or> 1, this should be the fastest.

Benchmarking:

 ag <- function(my.df) my.df[, res := sum(abs(my.df$x1-x1) < tol), by=x1] ro <- function(my.df) { my.df[,res:= { y = my.df$x1 sum(y > (x1 - tol) & y < (x1 + tol)) }, by=x1] } ar <- function(my.df) { my.df[, res := countOverlaps(IRanges(my.df$x1, width=1), IRanges(my.df$x1-tol+1, my.df$x1+tol-1))] } require(microbenchmark) microbenchmark(r1 <- ag(copy(my.df)), r2 <- ro(copy(my.df)), r3 <- ar(copy(my.df)), times=100) Unit: milliseconds expr min lq median uq max neval r1 <- ag(copy(my.df)) 33.15940 39.63531 41.61555 44.56616 208.99067 100 r2 <- ro(copy(my.df)) 69.35311 76.66642 80.23917 84.67419 344.82031 100 r3 <- ar(copy(my.df)) 11.22027 12.14113 13.21196 14.72830 48.61417 100 <~~~ identical(r1, r2) # TRUE identical(r1, r3) # TRUE

+9

Arun Aug 08 '13 at 12:04

source share

Here is a clean data.table solution:

 my.df[, res:=sum(my.df$x1 > (x1 - tol) & my.df$x1 < (x1 + tol)), by=x1] my.df <- adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N]) identical(my.df[,res],my.df[,V1]) #[1] TRUE

However, this will be relatively slow if you have many unique x1 . In the end, you need to make a huge number of comparisons, and I can't think of a way to avoid this right now.

+2

Rolling Aug 08 '13 at 7:36

source share

Using the fact that

  abs(xy) < tol ~ y-tol <= x <= y+ tol

You can increase productivity by 2 times.

 ## wrap codes in 2 function for benchmarking library(data.table) library(plyr) my.df = data.table(x1 = 1:1000, x2 = 4:1003) tol = 3 ag <- function() my.df[, res := sum(abs(my.df$x1-x1) < tol), by=x1] ro <- function() my.df[,res:= { y = my.df$x1 sum(y > (x1 - tol) & y < (x1 + tol)) }, by=x1] ## check equal results identical(ag(),ro()) TRUE library(microbenchmark) ## benchmarks microbenchmark(ag(), ro(),times=1) Unit: milliseconds expr min lq median uq max neval ag() 32.75638 32.75638 32.75638 32.75638 32.75638 1 ro() 63.50043 63.50043 63.50043 63.50043 63.50043 1

+2

agstudy Aug 08 '13 at 8:05

source share

eddi · Accepted Answer · 2013-08-08T15:01:53+0000

Here's a faster solution to data.table . The idea is to use the data.table merge data.table , but before we do this, we need to change the data a bit and make the column x1 numeric instead of the integer. This is because the OP uses strict inequality and uses sliding connections so that we will need to reduce the tolerance by a tiny amount by making it a floating point number.

 my.df[, x1 := as.numeric(x1)] # set the key to x1 for the merges and to sort # (note, if data already sorted can make this step instantaneous using setattr) setkey(my.df, x1) # and now we're going to do two rolling merges, one with the upper bound # and one with lower, then get the index of the match and subtract the ends # (+1, to get the count) my.df[, res := my.df[J(x1 + tol - 1e-6), list(ind = .I), roll = Inf]$ind - my.df[J(x1 - tol + 1e-6), list(ind = .I), roll = -Inf]$ind + 1] # and here the bench vs @Arun solution ed = function(my.df) { my.df[, x1 := as.numeric(x1)] setkey(my.df, x1) my.df[, res := my.df[J(x1 + tol - 1e-6), list(ind = .I), roll = Inf]$ind - my.df[J(x1 - tol + 1e-6), list(ind = .I), roll = -Inf]$ind + 1] } microbenchmark(ed(copy(my.df)), ar(copy(my.df))) #Unit: milliseconds # expr min lq median uq max neval # ed(copy(my.df)) 7.297928 10.09947 10.87561 11.80083 23.05907 100 # ar(copy(my.df)) 10.825521 15.38151 16.36115 18.15350 21.98761 100

Note: as Arun and Matthew pointed out, if x1 is an integer, then you do not need to convert to a numeric value and subtract a small amount from tol and use tol - 1L instead of tol - 1e-6 above.

R conditional grouping in data.table

Update:

See @eddi's answer for a quicker solution (for this specific problem). It also works when x1 not an integer.

Some explanation:

Benchmarking:

More articles:

See @eddi's answer for a quicker solution (for this specific problem). It also works when `x1` not an integer.