Delete duplicates across multiple channels

This seems like a simple problem, but I can't figure it out. I would like to remove duplicates from dataframe (df) if the two columns have the same values, even if these values ​​are in the reverse order . I mean, you have the following data frame:

a <- c(rep("A", 3), rep("B", 3), rep("C",2)) b <- c('A','B','B','C','A','A','B','B') df <-data.frame(a,b) ab 1 AA 2 AB 3 AB 4 BC 5 BA 6 BA 7 CB 8 CB 

If I remove duplicates now, I get the following data frame:

 df[duplicated(df),] ab 3 AB 6 BA 8 CB 

However, I would also like to delete row 6 in this data frame, since "A", "B" is the same as "B", "A". How to do it automatically?

Ideally, I could indicate which two columns should be compared, since data frames can have different columns and can be quite large.

Thanks!

+6
source share
4 answers

One solution is to sort each df line first:

 for (i in 1:nrow(df)) { df[i, ] = sort(df[i, ]) } df ab 1 AA 2 AB 3 AB 4 BC 5 AB 6 AB 7 BC 8 BC 

At this point, it's just a matter of removing duplicate elements:

 df = df[!duplicated(df),] df ab 1 AA 2 AB 4 BC 

As stated in the comments, your code really saves duplicates. You need to use !duplicated to remove them.

+3
source

Ari's answer extension to indicate columns to check if other columns exist:

 a <- c(rep("A", 3), rep("B", 3), rep("C",2)) b <- c('A','B','B','C','A','A','B','B') df <-data.frame(a,b) df$c = sample(1:10,8) df$d = sample(LETTERS,8) df abcd 1 AA 10 B 2 AB 8 S 3 AB 7 J 4 BC 3 Q 5 BA 2 I 6 BA 6 U 7 CB 4 L 8 CB 5 V cols = c(1,2) newdf = df[,cols] for (i in 1:nrow(df)){ newdf[i, ] = sort(df[i,cols]) } df[!duplicated(newdf),] abcd 1 AA 8 X 2 AB 7 L 4 BC 2 P 
+5
source

Other answers use a for loop to assign a value to each line. Although this is not a problem if you have 100 rows or even a thousand, you will wait a while if you have big data on the order of 1M rows.

Having data.table from another related answer using data.table , you can try something like:

 df[!duplicated(data.frame(list(do.call(pmin,df),do.call(pmax,df)))),] 

Comparative test with a large data set ( df2 ):

 df2 <- df[sample(1:nrow(df),50000,replace=TRUE),] system.time( df2[!duplicated(data.frame(list(do.call(pmin,df2),do.call(pmax,df2)))),] ) # user system elapsed # 0.07 0.00 0.06 system.time({ for (i in 1:nrow(df2)) { df2[i, ] = sort(df2[i, ]) } df2[!duplicated(df2),] } ) # user system elapsed # 42.07 0.02 42.09 
+3
source

Using apply would be a better option than loops.

 newDf <- data.frame(t(apply(df,1,sort))) 

All you have to do is remove duplicates.

 newDf <- newDf[!duplicated(newDf),] 
+2
source

Source: https://habr.com/ru/post/973790/


All Articles