I am a big fan and massive user of data.tables in R. I really use them for a lot of code, but I recently encountered a strange error:
I have a huge data table with several columns, for example:
xy 1: 1 a 2: 1 b 3: 1 c 4: 2 a 5: 2 b 6: 2 c 7: 3 a 8: 3 b 9: 3 c
if i choose
dataDT[x=='1']
In the end, I get
xy 1: 1 a
whereas
dataDT[(x=='1')]
gives me
xy 1: 1 a 2: 1 b 3: 1 c
Any ideas? x and y are the coefficient, and data.table is indexed by setKey on x .
ADDITIONAL INFORMATION AND CODE:
I really fixed this problem, but not so clear and intuitive.
My code is structured as follows: I have a function called from my main code, where I need to enter a column in the data table.
Earlier I used the following notation
dataT [, ps: = ° C,]
to do this.
Instead, I found that creating a new column using
dataT $ nC <- dataT $ oC
instead, completely corrects the error.
I tried to replicate the same error in a simpler code example, but I can't, possibly due to dependencies related to the size structure of my data.table, as well as the specific functions that I run on my table.
With that said, I have a working example that shows that when inserting a column using the dataT [, nC: = oC,] notation, it acts as if this table was passed by reference to a function, and not by value,
Also interesting enough by performing
dataDT [x == '1]
vs
dataDT [(x == '1)]
shows the same result, the latter is 10 times slower, which I noticed earlier. Hope this code can shed some light.
rm(list=ls()) library(data.table) superParF <- function(dtInput){ dtInputP <- dtInput[a==1] dtInputN <- dtInput[a==2] outDT <- rbind(dtInputP[,sum(y),by='x'], dtInputN[,sum(y),by='x']) return(outDT) } superFunction <- function(dtInput){