Data cleansing: a function for finding very similar variables

Question

Data cleansing: a function for finding very similar variables

I have some big data that partially consists of very similar variables. Some variables have missing values (for example, x3 and x5 in the example below), and some variables are similar, but with different labels (for example, x2 and x5). To clear my data, I want to identify and ultimately delete these similar variables. I am trying to write a function that returns the column names of all the same variable pairs. Here are some sample data:

# Example data

set.seed(222)

N <- 100
x1 <- round(rnorm(N, 0, 10))
x2 <- round(rnorm(N, 10, 20))
x3 <- x1
x3[sample(1:N, 7)] <- NA
x4 <- x1
x4[sample(1:N, 5)] <- round(rnorm(5, 0, 10))
x5 <- x2
x5 <- paste("A", x5, sep = "")
x5[sample(1:N, 15)] <- NA

df <- data.frame(x1, x2, x3, x4, x5)

df$x1 <- as.character(df$x1)
df$x2 <- as.character(df$x2)
df$x3 <- as.character(df$x3)
df$x4 <- as.character(df$x4)
df$x5 <- as.character(df$x5)

head(df)

As you can see, x1, x3 and x4 are very similar; and x2 and x5 are also very similar. My function is to print a list that includes all pairs with the same values in 80% or more cases. Here is what I got so far:

# My attempt to write such a function

fun_clean <- function(data, similarity) {

  output <- list()
  data <- data[complete.cases(data), ]

  for(i in 1:ncol(data)) {

    if(i < ncol(data)) {

      for(j in (i + 1):ncol(data)) {

        similarity_ij <- sum(data[ , i] == data[ , j]) / nrow(data)

        if(similarity_ij >= similarity) {

          output[[length(output) + 1]] <- colnames(data)[c(i, j)]

        }
      }
    }
  }

  output

}

fun_clean(data = df, similarity = 0.8)

x1, x3 x4. x2 x5 (.. ) . , . :

: ?

+4

function r bigdata similarity data-cleaning

JSP 27 . '17 15:23

2

Caret , , :

http://topepo.imtqy.com/caret/pre-processing.html

+2

Daniel Gimenez 27 . '17 15:33

Ken S. · Accepted Answer · 2017-11-27T15:42:40+0000

, . , gsub(), . :

df <- apply(df, 2, function(x) as.numeric( gsub("[^0-9]", "", x) ))

, combn(5, 2), , . , .

combs <- combn(ncol(df), 2)

res <- apply(combs, 2, function(x){
  sum(df[, x[1]] == df[, x[2]], na.rm = TRUE)/nrow(df)
})

thresh <- 0.8
combs[, res > thresh]
#      [,1] [,2] [,3] [,4]
# [1,]    1    1    2    3
# [2,]    3    4    5    4

, 1 & 3, 1 & 4, 2 & 5 3 & 4 80% .

. NA, !

Data cleansing: a function for finding very similar variables

More articles: