This is a follow-up to this question, where the accepted answer shows an example of an appropriate exercise using data.table, including non-equi conditions.
Background
The basic setup is what we have DT1with the sample information about the person and DT2that is the sorting of the database. And the goal is to find out if each person matches DT1at least one entry in DT2.
First, we initialize the column that indicates a match with FALSE, so that its values can be updated until TRUEwhenever a match is found.
DT1[, MATCHED := FALSE]
The following general solution is then used to update the column:
DT1[, MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]
In theory, it looks (and should work) perfectly. The sub-expression DT2[.SD, on=.(Criteria), .N, by=.EACHI]creates a subtable with each row of DT1and computes the column N, which is the number of matches for this row found in DT2. Then, when Ngreater than zero, the value MATCHEDin is DT1updated to TRUE.
It works according to a trivial reproducible example . But I came across some unexpected behavior using it with real data, and I can not understand it. Maybe I missed something, or it could be a mistake. Unfortunately, I cannot imagine a minimal reproducible example, because the data is large, and this is only visible in big data. But I will try to do it as best as possible.
Unexpected behavior or error
, , , , !(MATCHED) ,
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]
, , , . ( , DT2).
:
MATCHED N
1: FALSE 3248007
2: TRUE 2379514
:
MATCHED N
1: FALSE 2149648
2: TRUE 3477873
, , , . , , .. , , . ( ).
, DT1:
DATE FORENAME SURNAME
1: 2016-01-01 JOHN SMITH
DT2:
START_DATE EXPIRY_DATE FORENAME SURNAME
1: 2015-09-09 2017-05-01 JOHN SMITH
( ), , N, , , (N=0). ( , START_DATE END_DATE DATE , ).
SUB <- DF2[DF1, on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]
SUB[FORENAME=="JOHN" & "SURNAME=="SMITH"]
FORENAME SURNAME START_DATE EXPIRY_DATE N
1: JOHN SMITH 2016-01-01 2016-01-01 0
, , DF1. , , , JOHN SMITH DF1 149 DF1 :
DF2[DF1[149], on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]
FORENAME SURNAME START_DATE EXPIRY_DATE N
1: JOHN SMITH 2016-01-01 2016-01-01 1
-, , . on=.(FORENAME, SURNAME, START_DATE <= DATE), , .
, . , DT1 DATE DT2 START_DATE END_DATE s, , DT1 CHECKING_DATE DT2 EFFECTIVE_DATE ..
data.table , :
:
set.seed(123)
library(data.table)
library(stringi)
n <- 100000
DT1 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
DATE = sample(seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by="day"), n, replace=T))
DT2 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
START_DATE = sample(seq(as.Date('2015-01-01'), as.Date('2017-12-31'), by="day"), n, replace=T))
DT2[, EXPIRY_DATE := START_DATE + floor(runif(1000, 200,300))]
DT1[, MATCHED := FALSE]
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 85833
2: TRUE 14167
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]
MATCHED N
1: FALSE 73733
2: TRUE 26267