This is related to this question ( Can I access duplicate column names in `j` in join.table join? ), This was asked because I assumed that the opposite was true.
data.table with two columns:
Suppose you want to join two data.tables , and then perform a simple operation on two related columns, this can be done either in one or two calls .[ :
N = 1000000 DT1 = data.table(name = 1:N, value = rnorm(N)) DT2 = data.table(name = 1:N, value1 = rnorm(N)) setkey(DT1, name) system.time({x = DT1[DT2, value1 - value]}) # One Step system.time({x = DT1[DT2][, value1 - value]}) # Two Step
It turns out that making two calls - making the first connection, and then doing the subtraction - is noticeably faster than all at once .
> system.time({x = DT1[DT2, value1 - value]}) user system elapsed 0.67 0.00 0.67 > system.time({x = DT1[DT2][, value1 - value]}) user system elapsed 0.14 0.01 0.16
Why is this?
data.table with many columns:
If you put LOT columns in data.table , you will eventually find that the one-step approach is faster โ presumably because data.table uses only the columns that you reference in j .
N = 1000000 DT1 = data.table(name = 1:N, value = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi] DT2 = data.table(name = 1:N, value1 = rnorm(N))[, (letters) := pi][, (LETTERS) := pi][, (month.abb) := pi] setkey(DT1, name) system.time({x = DT1[DT2, value1 - value]}) system.time({x = DT1[DT2][, value1 - value]}) > system.time({x = DT1[DT2, value1 - value]}) user system elapsed 0.89 0.02 0.90 > system.time({x = DT1[DT2][, value1 - value]}) user system elapsed 1.64 0.16 1.81