data.table with two string columns of set elements, extracting unique rows with unsorted each row

Question

data.table with two string columns of set elements, extracting unique rows with unsorted each row

Suppose I have data.table like this:

Table:

V1 V2 AB CD CA BA DC

I want each row to be considered a set, which means that BA and AB are the same. So after this process I want to get:

 V1 V2 AB CD CA

To do this, I need to first sort the table by rows , and then use unique to remove duplicates. The sorting process is pretty slow if I have millions of rows. So, is there an easy way to remove duplicates without sorting?

+7

set r duplicates unique data.table

user2923419 Aug 05 '14 at 18:35

source share

3 answers

eddi · Answer 1 · 2014-08-06T02:35:58+0000

For two columns, you can use the following trick:

 dt = data.table(a = letters[1:5], b = letters[5:1]) # ab #1: ae #2: bd #3: cc #4: db #5: ea dt[dt[, .I[1], by = list(pmin(a, b), pmax(a, b))]$V1] # ab #1: ae #2: bd #3: cc

Frank · Answer 2 · 2019-04-16T18:56:54+0000

Borrowing (possibly unrealistic) data from a fool :

 library(data.table) size <- 118000000 key1 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) ) key2 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) ) val <- runif(size, 0.0, 5.0) dt <- data.table(key1, key2, val, stringsAsFactors=FALSE)

Here is a quick way if your data looks like this:

 # eddi answer system.time(res1 <- dt[dt[, .I[1], by=.(pmin(key1, key2), pmax(key1, key2))]$V1]) # user system elapsed # 101.79 3.01 107.98 # optimized for this data system.time({ dt2 <- unique(dt, by=c("key1", "key2"))[key1 > key2, c("key1", "key2") := .(key2, key1)] res2 <- unique(dt2, by=c("key1", "key2")) }) # user system elapsed # 8.50 1.16 4.93 fsetequal(copy(res1)[key1 > key2, c("key1", "key2") := .(key2, key1)], res2) # [1] TRUE

Such data seems unlikely if they are covariance, since you should have no more than one duplicate (i.e. AB with BA).

biobee · Answer 3 · 2015-01-30T13:29:22+0000

Here is an easy way to remove duplicate rows.

 delRows = NULL # the rows to be removed for(i in 1:nrow(tab)){ j = which(tab$V1 == tab$V2[i] & tab$V2 == tab$V1[i]) j = j [j > i] if (length(j) > 0){ delRows = c(delRows, j) } } tab = tab[-delRows,]

The result is, Before

 > tab V1 V2 1 AB 2 CD 3 CA 4 BA 5 DC

After

 > tab V1 V2 1 AB 2 CD 3 CA

data.table with two string columns of set elements, extracting unique rows with unsorted each row

More articles: