Clustering / matching many dimensions in R

I have a very large and complex data set with many company observations. Some of the company observations are redundant, and I need to make a key to match the redundant observations with one. However, the only way to tell if they really represent the same company is through the similarity of many variables. I believe that a suitable approach is a kind of clustering based on a variety of conditions or, perhaps, even on some kind of approach to assessing addiction. Perhaps I only need flexible tools to create a complex matrix of similarity.

Unfortunately, I'm not quite sure how to do this in R. Most of the tools I have seen for clustering and categorization seem to do this with both numerical distance and categorical data, but it seems like conditions or user-defined conditions.

Below I tried to create a smaller, open example of the type of data that I am working with and the result that I am trying to create. There are some conditions that must apply, for example, the location must be the same. There are some functions that can communicate with each other, for example var1 and var2. Then there are some functions that can associate with each other, but they should not conflict, for example var3.

An additional level of complexity is that the type of association that I am trying to use to match redundant observation is different. For example, id1 and id2 are the same company, redundantly entered into the data twice. In one place it is called "apples" and another "red apples". They have the same location, the values โ€‹โ€‹of var1 and var3 (after setting for formatting). Similarly, IDs 3, 5, and 6 are also truly one company, although most of the input for each is different. Some clusters identify several observations, others only one. Ideally, I would like to find a way to categorize or match cases based on several conditions, for example: 1. Verify that the location is the same 2. Verify that var3 is not different 3. Verify that the names are a substring of others 4. Verify the editing distance of the names 5. Check the similarity between var1 and var2 between cases

In any case, I hope there are more effective, more flexible tools for this than what I find, or someone has experience working with similar data in R. Any suggestions and tips are very appreciated!

Data

id name location var1 var2 var3 1 apples US 1 abc 12345 2 red apples US 1 NA 12-345 3 green apples Mexico 2 def 235-92 4 bananas Brazil 2 abc NA 5 oranges Mexico 2 NA 23592 6 green apple Mexico NA def NA 7 tangerines Honduras NA abc 3498 8 mango Honduras 1 NA NA 9 strawberries Honduras NA abcd 3498 10 strawberry Honduras NA abc 3498 11 blueberry Brazil 1 abcd 2348 12 blueberry Brazil 3 abc NA 13 blueberry Mexico NA def 1859 14 bananas Brazil 1 def 2348 15 blackberries Honduras NA abc NA 16 grapes Mexico 6 qrs NA 17 grapefruits Brazil 1 NA 1379 18 grapefruit Brazil 2 bcd 1379 19 mango Brazil 3 efaq NA 20 fuji apples US 4 NA 189-35 

Result

 id name location var1 var2 var3 Result 1 apples US 1 abc 12345 1 2 red apples US 1 NA 12-345 1 3 green apples Mexico 2 def 235-92 3 4 bananas Brazil 2 abc NA 4 5 oranges Mexico 2 NA 23592 3 6 green apple Mexico NA def NA 3 7 tangerines Honduras NA abc 3498 7 8 mango Honduras 1 NA NA 8 9 strawberries Honduras NA abcd 3498 7 10 strawberry Honduras NA abc 3498 7 11 blueberry Brazil 1 abcd 2348 11 12 blueberry Brazil 3 abc NA 11 13 blueberry Mexico NA def 1859 13 14 bananas Brazil 1 def 2348 11 15 blackberries Honduras NA abc NA 15 16 grapes Mexico 6 qrs NA 16 17 grapefruits Brazil 1 NA 1379 17 18 grapefruit Brazil 2 bcd 1379 17 19 mango Brazil 3 efaq NA 19 20 fuji apples US 4 NA 189-35 20 

Thanks in advance for your time and help!

+5
source share
1 answer
 library(stringdist) getMatches <- function(df, tolerance=6){ out <- integer(nrow(df)) for(row in 1:nrow(df)){ dists <- numeric(nrow(df)) for(col in 1:ncol(df)){ tempDist <- stringdist(df[row, col], df[ , col], method="lv") # WARNING: Matches NA perfectly. tempDist[is.na(tempDist)] <- 0 dists <- dists + tempDist } dists[row] <- Inf min_dist <- min(dists) if(min_dist < tolerance){ out[row] <- which.min(dists) } else{ out[row] <- row } } return(out) } test$Result <- getMatches(test[, -1]) 

Where test is your data. Probably, this definitely needs some refinement and, of course, needs some post-processing. This creates a column with the closest match index. If it cannot find a match within the given tolerance, it returns the index by itself.

EDIT: I will try to do a bit later.

0
source

Source: https://habr.com/ru/post/1204372/


All Articles