data.table with two string columns of set elements, extracting unique rows with unsorted each row

Suppose I have data.table like this:

Table:

V1 V2 AB CD CA BA DC 

I want each row to be considered a set, which means that BA and AB are the same. So after this process I want to get:

 V1 V2 AB CD CA 

To do this, I need to first sort the table by rows , and then use unique to remove duplicates. The sorting process is pretty slow if I have millions of rows. So, is there an easy way to remove duplicates without sorting?

+7
source share
3 answers

For two columns, you can use the following trick:

 dt = data.table(a = letters[1:5], b = letters[5:1]) # ab #1: ae #2: bd #3: cc #4: db #5: ea dt[dt[, .I[1], by = list(pmin(a, b), pmax(a, b))]$V1] # ab #1: ae #2: bd #3: cc 
+14
source

Borrowing (possibly unrealistic) data from a fool :

 library(data.table) size <- 118000000 key1 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) ) key2 <- sample( LETTERS, size, replace=TRUE, prob=runif(length(LETTERS), 0.0, 5.0) ) val <- runif(size, 0.0, 5.0) dt <- data.table(key1, key2, val, stringsAsFactors=FALSE) 

Here is a quick way if your data looks like this:

 # eddi answer system.time(res1 <- dt[dt[, .I[1], by=.(pmin(key1, key2), pmax(key1, key2))]$V1]) # user system elapsed # 101.79 3.01 107.98 # optimized for this data system.time({ dt2 <- unique(dt, by=c("key1", "key2"))[key1 > key2, c("key1", "key2") := .(key2, key1)] res2 <- unique(dt2, by=c("key1", "key2")) }) # user system elapsed # 8.50 1.16 4.93 fsetequal(copy(res1)[key1 > key2, c("key1", "key2") := .(key2, key1)], res2) # [1] TRUE 

Such data seems unlikely if they are covariance, since you should have no more than one duplicate (i.e. AB with BA).

0
source

Here is an easy way to remove duplicate rows.

 delRows = NULL # the rows to be removed for(i in 1:nrow(tab)){ j = which(tab$V1 == tab$V2[i] & tab$V2 == tab$V1[i]) j = j [j > i] if (length(j) > 0){ delRows = c(delRows, j) } } tab = tab[-delRows,] 

The result is, Before

 > tab V1 V2 1 AB 2 CD 3 CA 4 BA 5 DC 

After

 > tab V1 V2 1 AB 2 CD 3 CA 
-1
source

Source: https://habr.com/ru/post/973792/


All Articles