A strange problem with finding data.

Question

A strange problem with finding data.

I am a big fan and massive user of data.tables in R. I really use them for a lot of code, but I recently encountered a strange error:

I have a huge data table with several columns, for example:

xy 1: 1 a 2: 1 b 3: 1 c 4: 2 a 5: 2 b 6: 2 c 7: 3 a 8: 3 b 9: 3 c

if i choose

 dataDT[x=='1']

In the end, I get

  xy 1: 1 a

whereas

 dataDT[(x=='1')]

gives me

  xy 1: 1 a 2: 1 b 3: 1 c

Any ideas? x and y are the coefficient, and data.table is indexed by setKey on x .

ADDITIONAL INFORMATION AND CODE:

I really fixed this problem, but not so clear and intuitive.

My code is structured as follows: I have a function called from my main code, where I need to enter a column in the data table.

Earlier I used the following notation

dataT [, ps: = ° C,]

to do this.

Instead, I found that creating a new column using

dataT $ nC <- dataT $ oC

instead, completely corrects the error.

I tried to replicate the same error in a simpler code example, but I can't, possibly due to dependencies related to the size structure of my data.table, as well as the specific functions that I run on my table.

With that said, I have a working example that shows that when inserting a column using the dataT [, nC: = oC,] notation, it acts as if this table was passed by reference to a function, and not by value,

Also interesting enough by performing

dataDT [x == '1]

vs

dataDT [(x == '1)]

shows the same result, the latter is 10 times slower, which I noticed earlier. Hope this code can shed some light.

 rm(list=ls()) library(data.table) superParF <- function(dtInput){ dtInputP <- dtInput[a==1] dtInputN <- dtInput[a==2] outDT <- rbind(dtInputP[,sum(y),by='x'], dtInputN[,sum(y),by='x']) return(outDT) } superFunction <- function(dtInput){ #create new column dtInput[,z:=y,] #run function outDT <- rbindlist(lapply(unique(inputDT$x), function(i) superParF(inputDT[x==i]))) #output outDT return(outDT) } inputDT <- data.table(x = c(rep(1,100000), rep(2,100000), rep(3,100000), rep(4,100000), rep(5,100000)), y= c(rep(1:100000,5))) inputDT$x <- as.factor(inputDT$x) inputDT$y <- as.numeric(inputDT$y) inputDT <- rbind(inputDT,inputDT) inputDT$a <- c(rep(1,500000),rep(2,500000)) setkey(inputDT,x) #first observation-> the two searches do not work with the same performance a <- system.time(inputDT[x=='1']) b <- system.time(inputDT[(x=='1')]) print(a) print(b) out <- superFunction(inputDT) a <- system.time(inputDT[x=='1']) b <- system.time(inputDT[(x=='1')]) print(a) print(b) inputDT

+1

r data.table

nbafrank Feb 12 '16 at 1:46

source share

2 answers

Matt dowle · Answer 1 · 2016-02-12T03:09:29+0000

I asked in the comments to indicate the version number and follow the recommendations on the Support page. It contains:

Read and search on README.md. Is there a bug fix or a new feature related to your problem? We probably knew about the problem, or someone else reported it, and we already fixed the problem in the current version of development.

So, a search in README.md for the string "index" only with Ctrl-F in the browser gives:

21 Automatic indexing processes a logical subset of the factor column using the correct numeric value, # 1361. Thanks @mplatzer.
26 Automatic indexing returns the subset order correctly when input. the table is already sorted, # 1495. Thanks @huashan for the nice reproducible example.

Those installed in v1.9.7 are easily installed with a single command, described in detail on the Installation page.

The first (paragraph 21) looks suspiciously close to your problem. Therefore, please try v1.9.7 as requested on the support page in paragraph 4.

We ask you to specify the version number in front in order to save time, because we want to make sure that you are using at least v1.9.6 on CRAN, and not v1.9.4, which had this problem:

DT [column == value] no longer processes the value, except in the case of length 1 (when it still uses the DT key or automatic secondary key, as specified in v1.9.4). If length (value) == length (column), then it works by elements as a standard in R. Otherwise, a length error is generated to avoid common user errors. DT [column% in% values] still uses the DT key (or automatic secondary key), as before. Automatic indexing (i.e. Optimization == and% in%) can be disabled with options (datatable.auto.index = FALSE).

So, in which version are you working, and have you tried v1.9.7, since it looks like it's worth a try?

nbafrank · Answer 2 · 2016-02-13T20:34:18+0000

Using the dT [, Column: = Value] notation also seems to trigger a WAITING ERROR in another message!

data.table does not recognize a logical filter

Replacing dT [, Column: = Value] with dT $ Column <- Value corrects both my error and this message error.

@Matt Dowle: this post I am linking has a much more concise code I have and the error is the same! You could find a lot of help in your search to fix this problem!

A strange problem with finding data.

More articles: