Search for columns with the same data in a single data.frame

I have 1 data.frame named A, there are 5000 columns. How to find columns in this data.frame that are equal to each other.

-1
source share
3 answers

As @John mentioned, there are problems with use duplicated. I would add that wrapping data.frame forces all the data into the same data type before it is comparable to duplicated. For example, here is an example data.frame:

df <- data.frame( a = LETTERS[1:3],
                  b = 1:3,
                  c = as.character(1:3),
                  d = LETTERS[1:3],
                  e = 1:3,
                  f = 1:3)
df
#   a b c d e f
# 1 A 1 1 A 1 1
# 2 B 2 2 B 2 2
# 3 C 3 3 C 3 3

Note that a column is cvery similar to columns b, eand f, but not identical due to different types (character or number). The solution proposed by @Jubbles will ignore these differences.

identical data.frame. outer:

are.cols.identical <- function(col1, col2) identical(df[,col1], df[,col2])
identical.mat      <- outer(colnames(df), colnames(df),
                            FUN = Vectorize(are.cols.identical))
identical.mat
# [,1]  [,2]  [,3]  [,4]  [,5]  [,6]
# [1,]  TRUE FALSE FALSE  TRUE FALSE FALSE
# [2,] FALSE  TRUE FALSE FALSE  TRUE  TRUE
# [3,] FALSE FALSE  TRUE FALSE FALSE FALSE
# [4,]  TRUE FALSE FALSE  TRUE FALSE FALSE
# [5,] FALSE  TRUE FALSE FALSE  TRUE  TRUE
# [6,] FALSE  TRUE FALSE FALSE  TRUE  TRUE

( , , , .)

library(cluster)
distances <- as.dist(!identical.mat)
tree      <- hclust(distances)
cut       <- cutree(tree, h = 0.5)
cut
# [1] 1 2 3 1 2 2

split(colnames(df), cut)
# $`1`
# [1] "a" "d"
# 
# $`2`
# [1] "b" "e" "f"
# 
# $`3`
# [1] "c"

1:, ,

are.cols.identical <- function(col1,col2) isTRUE(all.equal((df[,col1],df[,col2]))

2: , :

cut <- apply(identical.mat, 1, function(x)match(TRUE, x))
split(colnames(df), cut)
+5

, .

digest(), ( @flodel data.frame )

df <- data.frame( a = LETTERS[1:3],
  b = 1:3,
  c = as.character(1:3),
  d = LETTERS[1:3],
  e = 1:3,
  f = 1:3)

dfDig <- sapply(df, digest)

ansL <- lapply(seq_along(dfDig), function(x) names(which(dfDig == dfDig[x])))

unique(ansL)

# [[1]]
# [1] "a" "d"

# [[2]]
# [1] "b" "e" "f"

# [[3]]
# [1] "c"

1.0 1.

@flodel, dfDig

split(colnames(df), vapply(dfDig, match, 1L, dfDig))
+4

How to transfer data frame and use duplicated()?

B <- as.data.frame(t(A))
dup1 <- duplicated(B)
# if you want to identify all duplicated rows
dup2 <- duplicated(B, fromLast = TRUE)
dup_final <- dup1 * dup2
saved_colnames <- colnames(A)[dup_final]
+2
source

Source: https://habr.com/ru/post/1658225/


All Articles