Search for the closest value in the data table.

I am trying to find the closest value for each processed observation. The data is as follows (partial data from 1.2M obs):

> dta id treatment score 1: 5 0 0.02381024 2: 10 0 0.05428605 3: 22 0 0.02118124 4: 27 0 0.01495214 5: 45 0 0.01877916 6: 50 0 0.02120360 7: 58 0 0.02207263 8: 60 0 0.02807019 9: 61 0 0.05432927 10: 65 1 0.59612077 11: 68 0 0.02482168 12: 72 1 0.14582400 13: 73 0 0.02371670 14: 77 0 0.02608826 15: 87 0 0.06852409 16: 88 0 0.07473471 17: 94 0 0.07160314 18: 97 0 0.02040747 19: 104 1 0.09878789 20: 108 0 0.02421807 

For each processed observation (i.e. treatment = 1), I would like to receive an unprocessed observation (i.e. treatment = 0) with the closest score and mark the selected observation as unsuitable for other processed observations to match.

For example, the first processed observation (line 10) will correspond to id = 88 (line 16), line 12 - line 17, etc. I am currently running a floowing loop:

 smpl_treated = dta[treatment == 1] smpl_untreated = dta[treatment == 0] n_tmp = nrow(smpl_treated) matched_id = matrix(0, n_tmp, 1) smpl_tmp = smpl_untreated for (i in 1:nrow(smpl_treated)) { x = smpl_treated[i]$score setkey(smpl_tmp, score) tmp = smpl_tmp[J(x), roll = "nearest"] matched_id[i] = tmp[[1]] smpl_tmp = smpl_tmp[id != tmp[[1]]] } matched_smpl = smpl_untreated[id %in% matched_id] > matched_smpl id treatment score 1: 87 0 0.06852409 2: 94 0 0.07160314 3: 88 0 0.07473471 

Any suggestions to do this in the data.table or make the loop faster? When using the original 1.2 Mb, the cycle takes more than 2 hours. Thanks in advance for your help!

+5
source share
2 answers

This develops the already accepted denis answer using the actual syntax of data.table , for example, use the on parameter instead of setkey() when attaching.

 # determine the minimum number of treated and untreated cases n <- min(dta[treatment == 0L, .N], dta[treatment == 1L, .N]) # order by descending score mdt <- dta[order(-score)][ # and pick the ids of the top n treated and untreated cases # so that the highest untreated score match the highest treated score, # the 2nd highest untreated the 2nd highest treated and so forth , .(id0 = head(.SD[treatment == 0L, id], n), id1 = head(.SD[treatment == 1L, id], n))] mdt 
  id0 id1 1: 88 65 2: 94 72 3: 87 104 
 # join the ids two times to show the data of the treated and untreated cases dta[dta[mdt, on = .(id==id0)], on = .(id = id1)] 
  id treatment score i.id i.treatment i.score 1: 65 1 0.59612077 88 0 0.07473471 2: 72 1 0.14582400 94 0 0.07160314 3: 104 1 0.09878789 87 0 0.06852409 
+1
source

I may have a solution if you order your data table, compose a subset and use the merge feature. Not sure if this is the best solution, but it seems to work for me to understand what you want to do, and it will be exactly faster than your loop:

 library(data.table) dta <- data.table(id = c(5,10,22,27,45,50,58,60,61,65,68,72,73,77,87,88,94,97,104,108), treatment = c(0, 0 ,0 ,0, 0, 0, 0 ,0 , 0 , 1, 0 ,1 ,0, 0 ,0 ,0 ,0 ,0 ,1 ,0), score = c(0.02381024, 0.05428605, 0.02118124, 0.01495214, 0.01877916, 0.02120360, 0.02207263, 0.02807019, 0.05432927, 0.59612077, 0.02482168, 0.14582400, 0.02371670, 0.02608826, 0.06852409, 0.07473471, 0.07160314, 0.02040747, 0.09878789, 0.02421807)) setkey(dta, score) # order by score treated_nbr <- dta[treatment == 1, .N] # just to simplify the next line selecteddata <- dta[treatment == 0, .SD[(.N - treated_nbr + 1):.N, .(correspid = id, correspscore = score, id = dta[treatment == 1, id])]] 

here we take the same number of ordered .N-treated_nbr+1):.N persons ( .N-treated_nbr+1):.N ) so that they have the closest result to the ordered one, and we combine id with the identifier of the processed one ( id = dta[,.SD[treatment == 1,id]] )

 setkey(selecteddata, id) setkey(dta, id) selecteddata[dta] # do the merging 

Not sure if this is exactly what you want, because I realized that it only works if your processed points are higher than not processed (which is the case in your example). You can add a condition to use the solution offered only for the processed person with a bar higher than not processed ones, and do the rest otherwise (I do not see a direct simple solution otherwise)

+1
source

Source: https://habr.com/ru/post/1272814/


All Articles