Difficult problem for multi-valued subsets of data.table inside function

Question

Difficult problem for multi-valued subsets of data.table inside function

I am trying to use the arguments for data.table for a subset (and apply the average to this subset). Basically I will pass functions to two keys and several elements of the third key; this seems to confuse R , but the operation works exactly as expected when performed outside of a function environment.

Here is an example that basically gets what I'm trying to do; it returns the wrong solution, while my own code raises an error (text pasted below):

 set.seed(12345) dt<-data.table(yr=rep(2000:2005,each=20), id=paste0(rep(rep(1:10,each=2),6)), deg=paste0(rep(1:2,60)), var=rnorm(120), key=c("yr","id","deg")) fcn <- function(yr,ids,deg){ dt[.(yr,ids,deg),mean(var)] } fcn(2004,paste0(1:3),"1")

This gives an answer, but it is completely wrong (more than a second). If I do it manually, no problem:

 > fcn(2004,paste0(1:3),"1") [1] 0.1262586 > dt[yr==2004&id %in% paste0(1:3)&deg=="1",mean(var)] [1] 0.4374115 > dt[.(2004,paste0(1:3),"1"),mean(var)] [1] 0.4374115

To crack what happened, I changed the fcn code to:

 fcn <- function(yr,ids,deg){ dt[.(yr,ids,deg),] }

What gives:

 > fcn(2004,paste0(1:3),"1") yr id deg var 1: 2000 1 1 0.5855288 2: 2000 2 2 -0.4534972 3: 2000 3 1 0.6058875 4: 2000 1 2 0.7094660 5: 2000 2 1 -0.1093033 --- 116: 2005 2 2 -1.3247553 117: 2005 3 1 0.1410843 118: 2005 1 2 -1.1562233 119: 2005 2 1 0.4224185 120: 2005 3 2 -0.5360480

Basically, fcn didn't do a subset! Why is this happening? Really upset.

If I transfer only one key instead of three, dt subsets only to the middle key. Weird:

 > fcn(2004,"1","1") yr id deg var 1: 2000 1 1 0.5855288 2: 2000 1 2 0.7094660 3: 2000 1 1 0.5855288 4: 2000 1 2 0.7094660 5: 2000 1 1 0.5855288 --- 116: 2005 1 2 -1.1562233 117: 2005 1 1 0.2239254 118: 2005 1 2 -1.1562233 119: 2005 1 1 0.2239254 120: 2005 1 2 -1.1562233

But if I pass only the middle keys of the function, it works fine:

 fcn <- function(ids){ dt[.(2004,ids,"1")] } > fcn(paste0(1:3)) yr id deg var 1: 2004 1 1 0.6453831 2: 2004 2 1 -0.3043691 3: 2004 3 1 0.9712207

Final editing: the problem is solved, but it would be nice to know what exactly is wrong:

Rename the arguments:

 fcn <- function(yyr,ids,ddeg){ dt[.(yyr,ids,ddeg),mean(var)] }

Something about reusing column names as variable names caused a problem, it seems, but I still don't quite understand what went wrong.

+6

r data.table

MichaelChirico Apr 28 '15 at 21:20

source share

1 answer

eddi · Accepted Answer · 2015-04-28T21:37:34+0000

The problem is that you are using column names inside i-expression , but expect them to have names outside of data.table . You can either rename the variable names in your function, or build the data.table connection outside, and then use the fact that the external environment will always be used for single data.table names:

 fcn <- function(yr,ids,deg){ tmp = data.table(yr, ids, deg) dt[tmp, mean(var)] } fcn(2004, paste0(1:3), "1") #[1] 0.4374115

See FAQ 2.12-2.13.

Difficult problem for multi-valued subsets of data.table inside function

More articles: