Unexpected behavior in data.table non-equi join

This is a follow-up to this question, where the accepted answer shows an example of an appropriate exercise using data.table, including non-equi conditions.

Background

The basic setup is what we have DT1with the sample information about the person and DT2that is the sorting of the database. And the goal is to find out if each person matches DT1at least one entry in DT2.

First, we initialize the column that indicates a match with FALSE, so that its values ​​can be updated until TRUEwhenever a match is found.

DT1[, MATCHED := FALSE]

The following general solution is then used to update the column:

DT1[, MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]

In theory, it looks (and should work) perfectly. The sub-expression DT2[.SD, on=.(Criteria), .N, by=.EACHI]creates a subtable with each row of DT1and computes the column N, which is the number of matches for this row found in DT2. Then, when Ngreater than zero, the value MATCHEDin is DT1updated to TRUE.

It works according to a trivial reproducible example . But I came across some unexpected behavior using it with real data, and I can not understand it. Maybe I missed something, or it could be a mistake. Unfortunately, I cannot imagine a minimal reproducible example, because the data is large, and this is only visible in big data. But I will try to do it as best as possible.

Unexpected behavior or error

, , , , !(MATCHED) ,

DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(Criteria), .N, by=.EACHI ]$N > 0L ]

, , , . ( , DT2).

:

   MATCHED       N
1:   FALSE 3248007
2:    TRUE 2379514

:

   MATCHED       N
1:   FALSE 2149648
2:    TRUE 3477873

, , , . , , .. , , . ( ).

, DT1:

         DATE FORENAME SURNAME
1: 2016-01-01     JOHN   SMITH

DT2:

   START_DATE EXPIRY_DATE FORENAME SURNAME
1: 2015-09-09  2017-05-01     JOHN   SMITH

( ), , N, , , (N=0). ( , START_DATE END_DATE DATE , ).

SUB <- DF2[DF1, on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]
SUB[FORENAME=="JOHN" & "SURNAME=="SMITH"]

   FORENAME SURNAME START_DATE EXPIRY_DATE N
1:     JOHN   SMITH 2016-01-01  2016-01-01 0

, , DF1. , , , JOHN SMITH DF1 149 DF1 :

DF2[DF1[149], on=.(FORENAME, SURNAME, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI]

   FORENAME SURNAME START_DATE EXPIRY_DATE N
1:     JOHN   SMITH 2016-01-01  2016-01-01 1

-, , . on=.(FORENAME, SURNAME, START_DATE <= DATE), , .

, . , DT1 DATE DT2 START_DATE END_DATE s, , DT1 CHECKING_DATE DT2 EFFECTIVE_DATE ..

data.table , :

  • /

:

set.seed(123)
library(data.table)
library(stringi)

n <- 100000

DT1 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
                  DATE = sample(seq(as.Date('2016-01-01'), as.Date('2016-12-31'), by="day"), n, replace=T))

DT2 <- data.table(RANDOM_STRING = stri_rand_strings(n, 5, pattern = "[a-k]"),
                  START_DATE = sample(seq(as.Date('2015-01-01'), as.Date('2017-12-31'), by="day"), n, replace=T))

DT2[, EXPIRY_DATE := START_DATE + floor(runif(1000, 200,300))]

#Initialization
DT1[, MATCHED := FALSE]

#First run
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]

   MATCHED     N
1:   FALSE 85833
2:    TRUE 14167

#Second run
DT1[!(MATCHED), MATCHED := DT2[.SD, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE), .N, by=.EACHI ]$N > 0L ]
DT1[, .N, by=MATCHED]

   MATCHED     N
1:   FALSE 73733
2:    TRUE 26267

#And so on with subsequent runs...
+2
1

, , , , , .

DT1 DT2, . .

DT1[, DT1_ID := 1:nrow(DT1)]
DT2[, DT2_ID := 1:nrow(DT2)]

, :

M <- DT2[DT1, on=.(RANDOM_STRING, START_DATE <= DATE, EXPIRY_DATE >= DATE)]

head(M, 3)

   RANDOM_STRING START_DATE EXPIRY_DATE DT2_ID DT1_ID
1:         diejk 2016-03-30  2016-03-30     NA      1
2:         afjgf 2016-09-14  2016-09-14     NA      2
3:         kehgb 2016-12-11  2016-12-11     NA      3

M DT1 DT2. DT2_ID = NA, . nrow(M) = 100969, , DT1 s > 1 DT2 row. ( .)

ifelse() DT1 , .

DT1$MATCHED <- ifelse(DT1$DT1_ID %in% M[!is.na(DT2_ID)]$DT1_ID, TRUE, FALSE)

: 13 316 100 000

DT1[, .N, by=MATCHED]

   MATCHED     N
1:   FALSE 86684
2:    TRUE 13316
+1

Source: https://habr.com/ru/post/1684102/


All Articles