I am trying to use the arguments for data.table for a subset (and apply the average to this subset). Basically I will pass functions to two keys and several elements of the third key; this seems to confuse R , but the operation works exactly as expected when performed outside of a function environment.
Here is an example that basically gets what I'm trying to do; it returns the wrong solution, while my own code raises an error (text pasted below):
set.seed(12345) dt<-data.table(yr=rep(2000:2005,each=20), id=paste0(rep(rep(1:10,each=2),6)), deg=paste0(rep(1:2,60)), var=rnorm(120), key=c("yr","id","deg")) fcn <- function(yr,ids,deg){ dt[.(yr,ids,deg),mean(var)] } fcn(2004,paste0(1:3),"1")
This gives an answer, but it is completely wrong (more than a second). If I do it manually, no problem:
> fcn(2004,paste0(1:3),"1") [1] 0.1262586 > dt[yr==2004&id %in% paste0(1:3)°=="1",mean(var)] [1] 0.4374115 > dt[.(2004,paste0(1:3),"1"),mean(var)] [1] 0.4374115
To crack what happened, I changed the fcn code to:
fcn <- function(yr,ids,deg){ dt[.(yr,ids,deg),] }
What gives:
> fcn(2004,paste0(1:3),"1") yr id deg var 1: 2000 1 1 0.5855288 2: 2000 2 2 -0.4534972 3: 2000 3 1 0.6058875 4: 2000 1 2 0.7094660 5: 2000 2 1 -0.1093033 --- 116: 2005 2 2 -1.3247553 117: 2005 3 1 0.1410843 118: 2005 1 2 -1.1562233 119: 2005 2 1 0.4224185 120: 2005 3 2 -0.5360480
Basically, fcn didn't do a subset! Why is this happening? Really upset.
If I transfer only one key instead of three, dt subsets only to the middle key. Weird:
> fcn(2004,"1","1") yr id deg var 1: 2000 1 1 0.5855288 2: 2000 1 2 0.7094660 3: 2000 1 1 0.5855288 4: 2000 1 2 0.7094660 5: 2000 1 1 0.5855288 --- 116: 2005 1 2 -1.1562233 117: 2005 1 1 0.2239254 118: 2005 1 2 -1.1562233 119: 2005 1 1 0.2239254 120: 2005 1 2 -1.1562233
But if I pass only the middle keys of the function, it works fine:
fcn <- function(ids){ dt[.(2004,ids,"1")] } > fcn(paste0(1:3)) yr id deg var 1: 2004 1 1 0.6453831 2: 2004 2 1 -0.3043691 3: 2004 3 1 0.9712207
Final editing: the problem is solved, but it would be nice to know what exactly is wrong:
Rename the arguments:
fcn <- function(yyr,ids,ddeg){ dt[.(yyr,ids,ddeg),mean(var)] }
Something about reusing column names as variable names caused a problem, it seems, but I still don't quite understand what went wrong.