Handling invalid selfref in R data.table when passing by reference to a function

My group writes a lot of code using data.table, and we sometimes bite "Invalid.internal.selfref detected and fixed, taking a copy of the whole table ...". This behavior may violate our code when a data table is passed by reference to a function, and I'm trying to figure out how to get around it.

Suppose I have a function that adds a column to data.table as a side effect - note that the original data table is not returned.

foo <- function(mydt){ mydt[, c := c("a", "b")] return(123) ) > x<- data.table(a=c(1,2), b=c(3,4)) > foo(x) [1] 123 > x abc 1: 1 3 a 2: 2 4 b 

x updated with a new column. This is the desired behavior.

Now suppose something happens that violates the internal self-ref at x:

 > x<- data.table(a=c(1,2), b=c(3,4)) > x[["a"]] <- c(7,8) > foo(x) [1] 123 Warning message: In `[.data.table`(mydt, , `:=`(c, c("a", "b"))) : Invalid .internal.selfref detected and fixed by taking a copy ... > x ab 1: 7 3 2: 8 4 

I understand what happened (mostly). The construction [[a]] is not data-friendly. x was converted to a data frame, and then back to a data table, which somehow messed up the internal work. Then, inside foo (), during a referenced column adding operation, this problem was discovered and a copy of mydt was made; a new column "c" has been added to mydt. However, this copy operation broke the connection between the passages between x and mydt, so the extra columns are not part of x.

The foo () function will be used by different people, and it will be difficult to protect against invalid internal situations of self-realization. Someone out there can easily do something like x [["a"]], which will lead to invalid input. I am trying to figure out how to handle this from within foo.

As long as I have this idea, at the beginning of foo ():

 if(!data.table:::selfrefok(mydt)) stop("mydt is corrupt.") 

This at least gives us the opportunity to identify the problem, but it is not very friendly to users of foo (), because the ways in which these inputs can be damaged can be quite opaque. Ideally, I would like to be able to fix corrupt input and support the desired foo () functions. But I don’t see how, if I do not restructure my code, so that foo returns mydt and assigns it x in the call area, which is possible, but not perfect. Any ideas?

+6
source share
2 answers

You must read the entire warning ....

Then you would notice

At an earlier point, this data table was copied by R (or was manually created using structure () or similar). Avoid the key <-, names <- and attr <- which currently (and weirdly) R can copy the entire data table.

[[<- similar to names<- and attr<- in that it will create a copy.

You can make sure that the behavior by reference consists in constructing a call with a replacement, and then in the parent frame

 foo <- function(x) { l <- substitute(x[,c := 'a'], as.list(match.call())['x']); eval.parent(l) return(123)} xx<- data.table(a=c(1,2), b=c(3,4)) xx[["a"]] <- c(7,8) foo(xx) # [1] 123 # Warning message: ..... # but it now works! xx # abc # 1: 7 3 a # 2: 8 4 a 

A warning remains, but the function works as desired.

+3
source

@pteehan, great question! In my opinion, a much cleaner solution would be to reallocate the reallocation during the appointment phase itself with a warning that basically says β€œdon't do this!”.

The way to do this would be using the [[<-.data.table , which currently does not exist. If I didn’t miss something, this would be a great addition, the purpose of which is not to encourage its use, but to catch such cases and direct people to their correct use (with a warning) and at the same time restore -allocation.

Rough:

 `[[<-.data.table` <- function(x, i, j, value) { warning("Don't do this. Use := instead.") call = sys.call() call[[1L]] = `[[<-.data.frame` ans = copy(eval(call, envir=parent.frame())) } foo <- function(mydt) { mydt[, c := c("a", "b")] return(123) } x <- data.table(a = c(1,2), b = c(3,4)) x[["a"]] <- c(7,8) # Warning message: # In `[[<-.data.table`(`*tmp*`, "a", value = c(7, 8)) : # Don't do this. Use := instead. data.table:::selfrefok(x) # [1] 1 foo(x) # [1] 123 x # abc # 1: 7 3 a # 2: 8 4 b 

Something in this direction should provide a cleaner solution that I consider. Maybe it should work out.

PS: This post explains in detail why there is a warning in your question.

+2
source

Source: https://habr.com/ru/post/972355/


All Articles