I have some big data that partially consists of very similar variables. Some variables have missing values ββ(for example, x3 and x5 in the example below), and some variables are similar, but with different labels (for example, x2 and x5). To clear my data, I want to identify and ultimately delete these similar variables. I am trying to write a function that returns the column names of all the same variable pairs. Here are some sample data:
set.seed(222)
N <- 100
x1 <- round(rnorm(N, 0, 10))
x2 <- round(rnorm(N, 10, 20))
x3 <- x1
x3[sample(1:N, 7)] <- NA
x4 <- x1
x4[sample(1:N, 5)] <- round(rnorm(5, 0, 10))
x5 <- x2
x5 <- paste("A", x5, sep = "")
x5[sample(1:N, 15)] <- NA
df <- data.frame(x1, x2, x3, x4, x5)
df$x1 <- as.character(df$x1)
df$x2 <- as.character(df$x2)
df$x3 <- as.character(df$x3)
df$x4 <- as.character(df$x4)
df$x5 <- as.character(df$x5)
head(df)
As you can see, x1, x3 and x4 are very similar; and x2 and x5 are also very similar. My function is to print a list that includes all pairs with the same values ββin 80% or more cases. Here is what I got so far:
fun_clean <- function(data, similarity) {
output <- list()
data <- data[complete.cases(data), ]
for(i in 1:ncol(data)) {
if(i < ncol(data)) {
for(j in (i + 1):ncol(data)) {
similarity_ij <- sum(data[ , i] == data[ , j]) / nrow(data)
if(similarity_ij >= similarity) {
output[[length(output) + 1]] <- colnames(data)[c(i, j)]
}
}
}
}
output
}
fun_clean(data = df, similarity = 0.8)
x1, x3 x4. x2 x5 (.. ) . , . :
: ?