How to make a connection X [Y] data.table without losing the existing master key on X?

When combining .tables X and Y data using X [Y], X must have the key that the Y key uses to create the connection. If X is a very large table and is usually used for columns not used in the join, then the X key must be changed for the join, and then restored back to the original key after joining. Is there an effective way to make a connection without losing the original primary key on X?

I have a large environment dataset with DT time series (1M rows, 36 columns), like a data table with a key on the site and date columns. I need to do calculations on existing columns in DT and / or insert a new column based on an existing column using a small lookup or recoding table.

Here is a minimal example:

require(data.table) # using v1.9.5 # main data table DT, keyed on site and date, with data column x DT <- data.table(site = rep(LETTERS[1:2], each=3), date = rep(1:3, times=2), x = rep(1:3*10, times=2), key = "site,date") DT # site date x # 1: A 1 10 # 2: A 2 20 # 3: A 3 30 # 4: B 1 10 # 5: B 2 20 # 6: B 3 30 # lookup table for x to y lookup, keyed on x x2y <- data.table(x = c(10,20), y = c(100,200), key = "x") x2y # xy # 1: 10 100 # 2: 20 200 

To join the x2y lookup table with the main DT table, I set the DT key to "x":

 setkey(DT,x) 

Then the connection works as expected.

 DT[x2y] # site date xy # 1: A 1 10 100 # 2: B 1 10 100 # 3: A 2 20 200 # 4: B 2 20 200 

and I can use the "y" from the lookup table in the calculations or create a new column in DT.

 DT[x2y, y:=y] # site date xy # 1: A 1 10 100 # 2: B 1 10 100 # 3: A 2 20 200 # 4: B 2 20 200 # 5: A 3 30 NA # 6: B 3 30 NA 

But now my DT time series dataset is bound to "x", and I need to return the key to the "site, date" for future use.

 setkey(DT,site,date) 

Is this approach (the X key, concatenation, and then the X repeated key) the fastest way to do this when the DT is very large (1M rows), or is there an equally efficient way to do this type of search without losing the original key on the large DT table?

+6
source share
2 answers

When using secondary keys (starting from version v.1.9.6) and fixing the latest errors when saving / deleting keys (in version 1.9.7), you can do this with on= :

 # join DT[x2y, on="x"] # key is removed as row order gets changed. # update using joins DT[x2y, y:=y, on="x"] # key is retained, as row order isn't changed. 
+4
source

Update: Thanks to the bug fix, this is no longer required. See Accepted Answer.


I would only join x :

 DT[,y:=x2y[J(DT$x)]$y] 

The DT key is stored here.

+7
source

Source: https://habr.com/ru/post/986134/


All Articles