Why is DT1 [DT2] [, value1-value] faster than DT1 [DT2, value1-value] on data.table with fewer columns?

Question

Why is DT1 [DT2] [, value1-value] faster than DT1 [DT2, value1-value] on data.table with fewer columns?

This is related to this question ( Can I access duplicate column names in `j` in join.table join? ), This was asked because I assumed that the opposite was true.

data.table with two columns:

Suppose you want to join two data.tables , and then perform a simple operation on two related columns, this can be done either in one or two calls .[ :

 N = 1000000 DT1 = data.table(name = 1:N, value = rnorm(N)) DT2 = data.table(name = 1:N, value1 = rnorm(N)) setkey(DT1, name) system.time({x = DT1[DT2, value1 - value]}) # One Step system.time({x = DT1[DT2][, value1 - value]}) # Two Step

It turns out that making two calls - making the first connection, and then doing the subtraction - is noticeably faster than all at once .

 > system.time({x = DT1[DT2, value1 - value]}) user system elapsed 0.67 0.00 0.67 > system.time({x = DT1[DT2][, value1 - value]}) user system elapsed 0.14 0.01 0.16

Why is this?

data.table with many columns:

If you put LOT columns in data.table , you will eventually find that the one-step approach is faster — presumably because data.table uses only the columns that you reference in j .

 N = 1000000 DT1 = data.table(name = 1:N, value = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi] DT2 = data.table(name = 1:N, value1 = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi] setkey(DT1, name) system.time({x = DT1[DT2, value1 - value]}) system.time({x = DT1[DT2][, value1 - value]}) > system.time({x = DT1[DT2, value1 - value]}) user system elapsed 0.89 0.02 0.90 > system.time({x = DT1[DT2][, value1 - value]}) user system elapsed 1.64 0.16 1.81

+6

r data.table

Corone Jul 18 '13 at 9:23

source share

1 answer

Arun · Accepted Answer · 2013-07-18T09:28:07+0000

I think this is due to the repeating subset of DT1[DT2, value1-value] for each name in DT2 . That is, you must perform operation j for each i here, and not just one operation j after join . It gets quite expensive with 1e6 unique records. That is, [.data.table becomes significant and noticeable.

 DT1[DT2][, value1-value] # similar to rowSums DT1[DT2, value1-value]

In the first case, DT1[DT2] you first join , and it is very fast. Of course, with a lot of columns, as you show, you will see the difference. But the point connects once. But in the second case, you group DT1 by a DT2 name, and for each of them you calculate the difference. That is, you multiply DT1 for each value of DT2 - one action "j" for each subset! You can see it better by simply doing this:

 Rprof() t1 <- DT1[DT2, value1-value] Rprof(NULL) summaryRprof() # $by.self # self.time self.pct total.time total.pct # "[.data.table" 0.96 97.96 0.98 100.00 # "-" 0.02 2.04 0.02 2.04 Rprof() t2 <- DT1[DT2][, value1-value] Rprof(NULL) summaryRprof() # $by.self # self.time self.pct total.time total.pct # "[.data.table" 0.22 84.62 0.26 100.00 # "-" 0.02 7.69 0.02 7.69 # "is.unsorted" 0.02 7.69 0.02 7.69

This overhead with a repeated subset seems to be overcome when you have too many columns, and join overtakes on many columns as a time-consuming operation. You can verify this yourself by profiling another code.

Why is DT1 [DT2] [, value1-value] faster than DT1 [DT2, value1-value] on data.table with fewer columns?

data.table with two columns:

data.table with many columns:

More articles: