My group writes a lot of code using data.table, and we sometimes bite "Invalid.internal.selfref detected and fixed, taking a copy of the whole table ...". This behavior may violate our code when a data table is passed by reference to a function, and I'm trying to figure out how to get around it.
Suppose I have a function that adds a column to data.table as a side effect - note that the original data table is not returned.
foo <- function(mydt){ mydt[, c := c("a", "b")] return(123) ) > x<- data.table(a=c(1,2), b=c(3,4)) > foo(x) [1] 123 > x abc 1: 1 3 a 2: 2 4 b
x updated with a new column. This is the desired behavior.
Now suppose something happens that violates the internal self-ref at x:
> x<- data.table(a=c(1,2), b=c(3,4)) > x[["a"]] <- c(7,8) > foo(x) [1] 123 Warning message: In `[.data.table`(mydt, , `:=`(c, c("a", "b"))) : Invalid .internal.selfref detected and fixed by taking a copy ... > x ab 1: 7 3 2: 8 4
I understand what happened (mostly). The construction [[a]] is not data-friendly. x was converted to a data frame, and then back to a data table, which somehow messed up the internal work. Then, inside foo (), during a referenced column adding operation, this problem was discovered and a copy of mydt was made; a new column "c" has been added to mydt. However, this copy operation broke the connection between the passages between x and mydt, so the extra columns are not part of x.
The foo () function will be used by different people, and it will be difficult to protect against invalid internal situations of self-realization. Someone out there can easily do something like x [["a"]], which will lead to invalid input. I am trying to figure out how to handle this from within foo.
As long as I have this idea, at the beginning of foo ():
if(!data.table:::selfrefok(mydt)) stop("mydt is corrupt.")
This at least gives us the opportunity to identify the problem, but it is not very friendly to users of foo (), because the ways in which these inputs can be damaged can be quite opaque. Ideally, I would like to be able to fix corrupt input and support the desired foo () functions. But I donβt see how, if I do not restructure my code, so that foo returns mydt and assigns it x in the call area, which is possible, but not perfect. Any ideas?