R: checking if a set of variables forms a unique index

I have a large data framework, and I want to check whether the values ​​of a set of (factorial) variables uniquely determine each row of data or not.

My current strategy is to combine variables, which in my opinion are index variables

dfAgg = aggregate(dfTemp$var1, by = list(dfTemp$var1, dfTemp$var2, dfTemp$var3), FUN = length)
stopifnot(sum(dfAgg$x > 1) == 0)

But this strategy takes forever. A more efficient method will be appreciated.

Thank.

+3
source share
3 answers

The package data.tableprovides a very fast method duplicatedand uniqueto data.tables. It also has an argument by=where you can provide columns by which duplicate / unique results should be calculated.

data.frame:

require(data.table)
set.seed(45L)
## use setDT(dat) if your data is a data.frame, 
## to convert it to a data.table by reference
dat <- data.table(var1=sample(100, 1e7, TRUE), 
                 var2=sample(letters, 1e7, TRUE), 
                 var3=sample(as.numeric(sample(c(-100:100, NA), 1e7,TRUE))))

system.time(any(duplicated(dat)))
#  user  system elapsed
# 1.632   0.007   1.671

25 , anyDuplicated.data.frame.

# if you want to calculate based on just var1 and var2
system.time(any(duplicated(dat, by=c("var1", "var2"))))
#  user  system elapsed
# 0.492   0.001   0.495

7.4 , anyDuplicated.data.frame.

+4

anyDuplicated:

anyDuplicated( dfTemp[, c("Var1", "Var2", "Var3") ] )

dplyr:

dfTemp %.% select(Var1, Var2, Var3) %.% anyDuplicated()

, anyDuplicated .

+1

What about:

length(unique(paste(dfTemp$var1, dfTemp$var2, dfTemp$var3)))==nrow(dfTemp)

Insert the variables in one line, get a unique one and compare the length of this vector with the number of lines in your data frame.

0
source

Source: https://habr.com/ru/post/1535240/


All Articles