R: Filter two data frames for duplicate values ​​in two columns and by groups

I have a data.frame dat that stores my regular data, and the groups are identified by ID .

 data <- structure(list(NAME = structure(c(1L, 1L, 2L), .Label = c("NAME1", "NAME2"), class = "factor"), ID = c(23L, 23L, 57L), REF_YEAR = c(1920L, 1938L, 1869L), SURV_YEAR = c(1938L, 1962L, 1872L), VALUE = c(20L, 40L, 34L)), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR","VALUE"), class = "data.frame", row.names = c(NA, -3L)) NAME ID REF_YEAR SURV_YEAR VALUE 1 NAME1 23 1920 1938 20 2 NAME1 23 1938 1962 40 3 NAME2 57 1869 1872 34 

And I have a second data.frame , dat_q , which I would like to compare with dat

 dat_q <- structure(list(NAME = structure(1:2, .Label = c("NAME1", "NAME2"), class = "factor"), ID = c(23L, 57L), REF_YEAR = c(1934L, 1866L), SURV_YEAR = c(1938L, 1868L), VALUE = structure(1:2, .Label = c("A", "B"), class = "factor")), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -2L)) NAME ID REF_YEAR SURV_YEAR VALUE 1 NAME1 23 1934 1938 A 2 NAME2 57 1866 1868 B 

My question is: how could I delete all rows in dat_q that have the same value in REF_YEAR or SURV_YEAR columns than in the same dat columns (in 1938 data samples)? This should be applied by the group (as defined by the ID ), and not over the whole data.frame

In the end, using my sample data, this will be the result obtained from the dat_q filtering dat_q

  NAME ID REF_YEAR SURV_YEAR VALUE 2 NAME2 57 1866 1868 B 

EDIT

Here are some other examples of data that the code provided by @thelatemail will not work with. And I cannot understand why dat_q should be filtered out because it contains the same value as dat .

 data <- structure(list(NAME = structure(c(1L, 1L, 1L), .Label = "NAME1", class = "factor"), ID = c(226L, 226L, 226L), SURV_YEAR = c(2009L, 2010L, 2012L), REF_YEAR = c(2008L, 2009L, 2011L), VALUE = c(-7L, -37L, -51L)), .Names = c("NAME", "ID", "SURV_YEAR", "REF_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -3L)) NAME ID SURV_YEAR REF_YEAR VALUE 1 NAME1 226 2009 2008 -7 2 NAME1 226 2010 2009 -37 3 NAME1 226 2012 2011 -51 dat_q <- structure(list(NAME = structure(1L, .Label = "NAME1", class = "factor"), ID = 226L, REF_YEAR = 2010L, SURV_YEAR = 2011L, VALUE = structure(1L, .Label = "-X", class = "factor")), .Names = c("NAME", "ID", "REF_YEAR", "SURV_YEAR", "VALUE"), class = "data.frame", row.names = c(NA, -1L)) NAME ID REF_YEAR SURV_YEAR VALUE 1 NAME1 226 2010 2011 -X 

+6
source share
1 answer

I like by in the R database to figure out the logic of these kinds of problems. This works, but can be a bit slow:

 do.call(rbind,by( dat_q, dat_q$ID, function(x) { subdata <- data[data$ID==x$ID,] x[!(x$REF_YEAR %in% subdata$REF_YEAR | x$SURV_YEAR %in% subdata$SURV_YEAR),] } )) # NAME ID REF_YEAR SURV_YEAR VALUE #57 NAME2 57 1866 1868 B 

A data.table solution by the same logic can be faster:

 library(data.table) setDT(dat_q) setDT(data) dat_q[ , .SD[!(REF_YEAR %in% data$REF_YEAR[data[,ID==.BY]] | SURV_YEAR %in% data$SURV_YEAR[data[,ID==.BY]])], by=ID ] # ID NAME REF_YEAR SURV_YEAR VALUE #1: 57 NAME2 1866 1868 B 

With data.table , I think you can also do it this way. After converting to data.tables,

 # using 1.9.3+, just remove `by=.EACHI` if you're using <= 1.9.2 setkey(data, ID) setkey(dat_q, ID) idx = data[dat_q, any(c(i.REF_YEAR, i.SURV_YEAR) %in% c(REF_YEAR, SURV_YEAR)), by=.EACHI]$V1 dat_q[!idx] # NAME ID REF_YEAR SURV_YEAR VALUE # 1: NAME2 57 1866 1868 B 

We perform the union, and in each row of the data correspondence corresponding to dat_q , in the key columns we calculate the expression in j . This gives us the boolean that we need to index / subset dat_q later.

+5
source

Source: https://habr.com/ru/post/976079/


All Articles