Difficult problem for multi-valued subsets of data.table inside function

I am trying to use the arguments for data.table for a subset (and apply the average to this subset). Basically I will pass functions to two keys and several elements of the third key; this seems to confuse R , but the operation works exactly as expected when performed outside of a function environment.

Here is an example that basically gets what I'm trying to do; it returns the wrong solution, while my own code raises an error (text pasted below):

 set.seed(12345) dt<-data.table(yr=rep(2000:2005,each=20), id=paste0(rep(rep(1:10,each=2),6)), deg=paste0(rep(1:2,60)), var=rnorm(120), key=c("yr","id","deg")) fcn <- function(yr,ids,deg){ dt[.(yr,ids,deg),mean(var)] } fcn(2004,paste0(1:3),"1") 

This gives an answer, but it is completely wrong (more than a second). If I do it manually, no problem:

 > fcn(2004,paste0(1:3),"1") [1] 0.1262586 > dt[yr==2004&id %in% paste0(1:3)&deg=="1",mean(var)] [1] 0.4374115 > dt[.(2004,paste0(1:3),"1"),mean(var)] [1] 0.4374115 

To crack what happened, I changed the fcn code to:

 fcn <- function(yr,ids,deg){ dt[.(yr,ids,deg),] } 

What gives:

 > fcn(2004,paste0(1:3),"1") yr id deg var 1: 2000 1 1 0.5855288 2: 2000 2 2 -0.4534972 3: 2000 3 1 0.6058875 4: 2000 1 2 0.7094660 5: 2000 2 1 -0.1093033 --- 116: 2005 2 2 -1.3247553 117: 2005 3 1 0.1410843 118: 2005 1 2 -1.1562233 119: 2005 2 1 0.4224185 120: 2005 3 2 -0.5360480 

Basically, fcn didn't do a subset! Why is this happening? Really upset.

If I transfer only one key instead of three, dt subsets only to the middle key. Weird:

 > fcn(2004,"1","1") yr id deg var 1: 2000 1 1 0.5855288 2: 2000 1 2 0.7094660 3: 2000 1 1 0.5855288 4: 2000 1 2 0.7094660 5: 2000 1 1 0.5855288 --- 116: 2005 1 2 -1.1562233 117: 2005 1 1 0.2239254 118: 2005 1 2 -1.1562233 119: 2005 1 1 0.2239254 120: 2005 1 2 -1.1562233 

But if I pass only the middle keys of the function, it works fine:

 fcn <- function(ids){ dt[.(2004,ids,"1")] } > fcn(paste0(1:3)) yr id deg var 1: 2004 1 1 0.6453831 2: 2004 2 1 -0.3043691 3: 2004 3 1 0.9712207 

Final editing: the problem is solved, but it would be nice to know what exactly is wrong:

Rename the arguments:

 fcn <- function(yyr,ids,ddeg){ dt[.(yyr,ids,ddeg),mean(var)] } 

Something about reusing column names as variable names caused a problem, it seems, but I still don't quite understand what went wrong.

+6
source share
1 answer

The problem is that you are using column names inside i-expression , but expect them to have names outside of data.table . You can either rename the variable names in your function, or build the data.table connection outside, and then use the fact that the external environment will always be used for single data.table names:

 fcn <- function(yr,ids,deg){ tmp = data.table(yr, ids, deg) dt[tmp, mean(var)] } fcn(2004, paste0(1:3), "1") #[1] 0.4374115 

See FAQ 2.12-2.13.

+7
source

Source: https://habr.com/ru/post/986189/


All Articles