Intersection of multiple columns between two data frames

I have two data frames with two columns in each. For instance:

df.1 = data.frame(col.1 = c("a","a","a","a","b","b","b","c","c","d"), col.2 = c("b","c","d","e","c","d","e","d","e","e"))
df.2 = data.frame(col.1 = c("b","b","b","a","a","e"), col.2 = c("a","c","e","c","e","c"))

and I am looking for an efficient way to find the row index in df.2 of each pair of rows col.1 col.2 from df.1. Please note that a pair of lines in df.1 may appear in df.2 in the reverse order (for example, df.1 [1,], which is "a", "b" appears in df.2 [1,] as " b "," a "). It doesn't matter to me. In other words, as long as a pair of lines in df.1 appears in any order in df.2, I want its row index in df.2, otherwise it should return NA. One more note: line pairs in both data frames are unique: each line pair appears only once.

So, for these two data frames, the returned vector will be:

c(1,4,NA,5,2,NA,3,NA,6,NA)
+4
source share
4 answers

Maybe something is using the package dplyr:

first create a reference frame

  • use row_number()to efficiently calculate the row index.
  • use selectto flip columns.

two halves:

df_ref_top <- df.2 %>% mutate(n=row_number())
df_ref_btm <- df.2 %>% select(col.1=col.2, col.2=col.1) %>% mutate(n=row_number())

then tie together:

df_ref <- rbind(df_ref_top,df_ref_btm)

Left join and select a vector:

gives you an answer

left_join(df.1,df_ref)$n
+5
source
# Per @thelatemail comment, here a more elegant approach: 
match(apply(df.1,1,function(x) paste(sort(x),collapse="")),
      apply(df.2,1,function(x) paste(sort(x),collapse="")))

# My original answer, for reference:
# Check for matches with both orderings of df.2 columns
match.tmp = cbind(match(paste(df.1[,1],df.1[,2]), paste(df.2[,1],df.2[,2])),
                  match(paste(df.1[,1],df.1[,2]), paste(df.2[,2],df.2[,1])))

# Convert to single vector of match indices
match.index = apply(match.tmp, 1, 
                    function(x) ifelse(all(is.na(x)), NA, max(x, na.rm=TRUE)))

 [1]  1  4 NA  5  2 NA  3 NA  6 NA
+3
source

1) sort/merge df.2 df.2.s . df.1 ( ):

df.2.s <- replace(df.2, TRUE, t(apply(df.2, 1, sort)))
df.2.s$row <- 1:nrow(df.2.s)
merge(df.1, df.2.s, all.x = TRUE)$row

:

[1]  1  4 NA  5  2 NA  3 NA  6 NA

2) sqldf - SQL, df1 df2. , col_1 col_2, df1 df2 . df2 min max df1 ( ):

df1 <- df.1
df2 <- df.2

library(sqldf)
sqldf("select b.rowid row
    from df1
    left join 
    (select min(col_1, col_2) col_1, max(col_1, col_2) col_2 from df2) b
    using (col_1, col_2)")$row

. .

+2

, R ( , ).

check.rows <- function(data1, data2)
{
  df1 <- as.matrix(data1);df2 <- as.matrix(data2);ll <- vector('list', nrow(df1))
  for(i in seq(nrow(df1))){
    ll[[i]] <- sapply(seq(nrow(df2)), function(j) df2[j,] %in% df1[i,])
  }
  h <- sapply(ll, function(x) which(apply(x, 2, all)))
  sapply(h, function(x) ifelse(is.double(x), NA, x))
}

check.rows(df.1, df.2)
## [1]  1  4 NA  5  2 NA  3 NA  6 NA

And here is the standard when the size of the lines is increased for both df.1, and for df.2. Not bad, I think, given 24 checks on each of the 40 lines.

> dim(df.11); dim(df.22)
[1] 40  2
[1] 24  2
> f <- function() check.rows(df.11, df.22)
> microbenchmark(f())
## Unit: milliseconds
##  expr      min       lq   median       uq      max neval
##   f() 75.52258 75.94061 76.96523 78.61594 81.00019   100
+1
source

Source: https://habr.com/ru/post/1541713/


All Articles