R object not found if defined in function using data.table dplyr

Question

R object not found if defined in function using data.table dplyr

Note The behavior described is fixed in the dplyr dev version. You can install dplyr with devtools :: install_github ("hadley / dplyr")

See this minimal example; I am using dplyr v0.3.0.2 and data.table v1.9.4

library(dplyr) library(data.table) f <- function(x, y, bad) { z <- data.table(x,y, key = "x") z2 <- z %>% group_by(x) %>% summarise(sum.bad = sum(y == bad)) z2 } f(rnorm(100), rnorm(100) < 0, bad = FALSE)

When I run above, I get

 Error in `[.data.table`(dt, , list(sum.bad = sum(y == bad)), by = vars) : object 'bad' not found

However, a poor report is clearly defined in terms of coverage.

If I just run this outside the function, it works

  x <- rnorm(100) y <- rnorm(100) <0 bad <- FALSE z <- data.table(x,y, key = "x") z2 <- z %>% group_by(x) %>% summarise(sum.bad = sum(y == bad)) z2

What is the problem? Is this an error using data.table or dplyr?

+6

r data.table dplyr

xiaodai Jan 05 '15 at 5:28

source share

1 answer

Mrflick · Answer 1 · 2015-01-05T07:47:37+0000

This seems to be a problem with the way dplyr sets up the environment for calling data.table. The problem appears in the dplyr:::summarise_.grouped_dt function dplyr:::summarise_.grouped_dt . Currently it looks like

 function (.data, ..., .dots) { dots <- lazyeval::all_dots(.dots, ..., all_named = TRUE) for (i in seq_along(dots)) { if (identical(dots[[i]]$expr, quote(n()))) { dots[[i]]$expr <- quote(.N) } } list_call <- lazyeval::make_call(quote(list), dots) call <- substitute(dt[, list_call, by = vars], list(list_call = list_call$expr)) env <- dt_env(.data, parent.frame()) out <- eval(call, env) grouped_dt(out, drop_last(groups(.data)), copy = FALSE) } <environment: namespace:dplyr>

and if we debug this function and look at the trace when it is called, we see

 where 1: summarise_.grouped_dt(.data, .dots = lazyeval::lazy_dots(...)) where 2: summarise_(.data, .dots = lazyeval::lazy_dots(...)) where 3: summarise(., sum.bad = sum(y == bad)) where 4: function_list[[k]](value) where 5: withVisible(function_list[[k]](value)) where 6: freduce(value, `_function_list`) where 7: `_fseq`(`_lhs`) where 8: eval(expr, envir, enclos) where 9: eval(quote(`_fseq`(`_lhs`)), env, env) where 10: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env)) where 11 at #3: z %>% group_by(x) %>% summarise(sum.bad = sum(y == bad)) where 12: f(rnorm(100), rnorm(100) < 0, bad = FALSE)

So the important line is

 env <- dt_env(.data, parent.frame())

one. Here he sets up the path to the environment, which determines where to look for all the variables in the call. Here it just uses parent.frame, which looks for where the function was called from, but since you are actually jumping through a few hoops to get to that function from your summarize call inside f() , it doesn't seem to be the right parent frame . If instead you run

 env <- dt_env(.data, parent.frame(2))

in debug mode, which apparently falls into the correct parent frame. So I think the problem is the jump from summarize() to summarize_() , because this

 ff <- function(x, y, bad) { z <- data.table(x,y, key = "x") z2 <- z %>% group_by(x) %>% summarise_(.dots=list(sum.bad = quote(sum(y == bad)))) z2 } ff(rnorm(100), rnorm(100) < 0, bad = FALSE)

seems to work. So really dplyr needs to set up the correct environment. The tricky part is that it looks different if you call summarize or summarize_ directly. Perhaps summarise() can change the environment when it calls summarise_ to have the same parent.frame via eval() . But I would probably have logged this as a bug report, and Hadley decided how to fix it. Sort of

 summarise <- function(.data, ...) { call <- match.call() call <- as.call(c(as.list(call)[1:2], list(.dots=as.list(call)[-(1:2)]))) call[[1]] <- quote(summarise_) eval(call, envir=parent.frame()) }

will be a "traditional" way to do this. Not sure if the lazyeval package has any nicer ways to do this or not.

Tested with data.table_1.9.2 and dplyr_0.3.0.2

R object not found if defined in function using data.table dplyr

More articles: