Why automatic translation of data.table when assigning all columns by reference

Question

Why automatic translation of data.table when assigning all columns by reference

This is what I don’t understand with data.table If I select a row and I try to set all the values of this row to NA , the new data-table.table will be converted to logical

 #Here is a sample table DT <- data.table(a=rep(1L,3),b=rep(1.1,3),d=rep('aa',3)) DT abd 1: 1 1.1 aa 2: 1 1.1 aa 3: 1 1.1 aa #Here I extract a line, all the column types are kept... good str(DT[1]) Classes 'data.table' and 'data.frame': 1 obs. of 3 variables: $ a: int 1 $ b: num 1.1 $ d: chr "aa" - attr(*, ".internal.selfref")=<externalptr> #Now here I want to set them all to NA...they all become logicals => WHY IS THAT ? str(DT[1][,colnames(DT):=NA]) Classes 'data.table' and 'data.frame': 1 obs. of 3 variables: $ a: logi NA $ b: logi NA $ d: logi NA - attr(*, ".internal.selfref")=<externalptr>

EDIT: I think this is a mistake like

 R) str(DT[1][,a:=NA]) Classes 'data.table' and 'data.frame': 1 obs. of 3 variables: $ a: logi NA $ b: num 1.1 $ d: chr "aa" - attr(*, ".internal.selfref")=<externalptr> R) str(DT[1:2][,a:=NA]) Classes 'data.table' and 'data.frame': 2 obs. of 3 variables: $ a: int NA NA $ b: num 1.1 1.1 $ d: chr "aa" "aa" - attr(*, ".internal.selfref")=<externalptr>

+4

r data.table

statquant Sep 03 '13 at 13:48

source share

1 answer

Matt dowle · Accepted Answer · 2013-09-04T11:03:15+0000

To provide an answer, from ?":=" :

Unlike <- for data.fram, (potentially large) LHS does not force the type (often small) of RHS. Instead, RHS is enforced as LHS if necessary. Where this suggests that double-precision values are forcibly bound to an integer column, a warning is given (regardless of whether fractional data is truncated). The motivation for this is efficiency. It’s best to choose column types and stick to them. Changing a column type is possible, but deliberately harder: providing a full column as RHS. . This RHS is then pushed into this column slot, and we call this syntax plonk or replace the column syntax if you prefer. To create a full-length vector of a new type , you, as a user, are more aware of what is happening, and you are really going to change the type of the column to the clearer readers of your code.

The motivation for all this is large tables (say 10 GB in RAM), of course. Not 1 or 2 row tables.

Simply put: if length (RHS) == nrow (DT), then RHS (and any type) will be laid in this column slot. Even if these lengths are 1. If the length (RHS) (DT) memory for the column (and its type) is stored in place, but RHS is forced and reused to replace elements (subsets) in this column.

If I need to change the type of a column in a large table, I write:

 DT[, col := as.numeric(col)]

here as.numeric allocates a new vector, coherent col, into this new memory, which is then pushed into the column slot. It is as effective as possible. The reason plonk is length (RHS) == nrow (DT).

If you want to overwrite the column with another type containing a specific default value:

 DT[, col := rep(21.5,nrow(DT))] # ie, deliberately harder

If col was an integer type before, then it will change to enter a numeric number containing 21.5 for each line. Otherwise, just DT[, col := 21.5] will lead to a warning about forcing 21.5 to 21 (if DT is not only 1 line!)

Why automatic translation of data.table when assigning all columns by reference

More articles: