Combine partial match data in r

I have two datasets

datf1 <- data.frame (name = c("regular", "kklmin", "notSo", "Jijoh", "Kish", "Lissp", "Kcn", "CCCa"), number1 = c(1, 8, 9, 2, 18, 25, 33, 8)) #----------- name number1 1 regular 1 2 kklmin 8 3 notSo 9 4 Jijoh 2 5 Kish 18 6 Lissp 25 7 Kcn 33 8 CCCa 8 datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp", "KcN", "CaPN"), number2 = c(2, 8, 12, 13, 20, 18, 13)) #------------- name number2 1 reGulr 2 2 ntSo 8 3 Jijoh 12 4 sean 13 5 LiSsp 20 6 KcN 18 7 CaPN 13 

I want to combine them in a column of names, however a partial match is allowed (to avoid the difficulty of merging spelling errors in a large data set and even to detect such spelling errors) and, for example,

(1) If four consecutive letters (all, if the number of letters is less than 4) in any position - this corresponds to a fine

  ABBCD = BBCDK = aBBCD = ramABBBCD = ABB 

(2) ABBCD = aBbCd sensitivity is disabled in the match, e.g. ABBCD = aBbCd

(3) The new dataset will contain both names (names from datf1 and datf2). Thus, this letter we can determine if the match is perfect (can a single column with the number of letters match)

Is such a merger possible?

Changes:

 datf1 <- data.frame (name = c("xxregular", "kklmin", "notSo", "Jijoh", "Kish", "Lissp", "Kcn", "CCCa"), number1 = c(1, 8, 9, 2, 18, 25, 33, 8)) datf2 <- data.frame (name = c("reGulr", "ntSo", "Jijoh", "sean", "LiSsp", "KcN", "CaPN"), number2 = c(2, 8, 12, 13, 20, 18, 13)) uglyMerge(datf1, datf2) name1 name2 number1 number2 matches 1 xxregular <NA> 1 NA 0 2 kklmin <NA> 8 NA 0 3 notSo <NA> 9 NA 0 4 Jijoh Jijoh 2 12 5 5 Kish <NA> 18 NA 0 6 Lissp LiSsp 25 20 5 7 Kcn KcN 33 18 3 8 CCCa <NA> 8 NA 0 9 <NA> reGulr NA 2 0 10 <NA> ntSo NA 8 0 11 <NA> sean NA 13 0 12 <NA> CaPN NA 13 0 
+6
source share
2 answers

Maybe there is a simple solution, but I can not find.
IMHO, you must implement this type of merger for yourself.
The following is an ugly example (there are many opportunities for improvement):

 uglyMerge <- function(df1, df2) { ## lower all strings to allow case-insensitive comparison lowerNames1 <- tolower(df1[, 1]); lowerNames2 <- tolower(df2[, 1]); ## split strings into single characters names1 <- strsplit(lowerNames1, ""); names2 <- strsplit(lowerNames2, ""); ## create the final dataframe mergedDf <- data.frame(name1=as.character(df1[,1]), name2=NA, number1=df1[,2], number2=NA, matches=0, stringsAsFactors=FALSE); ## store names of dataframe2 (to remember which strings have no match) toMerge <- df2[, 1]; for (i in seq(along=names1)) { for (j in seq(along=names2)) { ## set minimal match to 4 or to string length minMatch <- min(4, length(names2[[j]])); ## find single matches matches <- names1[[i]] %in% names2[[j]]; ## look for consecutive matches r <- rle(matches); ## any matches found? if (any(r$values)) { ## find max consecutive match possibleMatch <- r$value == TRUE; maxPos <- which(which.max(r$length[possibleMatch]) & possibleMatch)[1]; ## store max conscutive match length maxMatch <- r$length[maxPos]; ## to remove FALSE-POSITIVES (eg CCC and kcn) find ## largest substring start <- sum(r$length[0:(maxPos-1)]) + 1; stop <- start + r$length[maxPos] - 1; maxSubStr <- substr(lowerNames1[i], start, stop); ## all matching criteria fulfilled isConsecutiveMatch <- maxMatch >= minMatch && grepl(pattern=maxSubStr, x=lowerNames2[j], fixed=TRUE) && nchar(maxSubStr) > 0; if (isConsecutiveMatch) { ## merging mergedDf[i, "matches"] <- maxMatch mergedDf[i, "name2"] <- as.character(df2[j, 1]); mergedDf[i, "number2"] <- df2[j, 2]; ## don't append this row to mergedDf because already merged toMerge[j] <- NA; ## stop inner for loop here to avoid possible second match break; } } } } ## append not matched rows to mergedDf toMerge <- which(df2[, 1] == toMerge); df2 <- data.frame(name1=NA, name2=as.character(df2[toMerge, 1]), number1=NA, number2=df2[toMerge, 2], matches=0, stringsAsFactors=FALSE); mergedDf <- rbind(mergedDf, df2); return (mergedDf); } 

Conclusion:

 > uglyMerge(datf1, datf2) name1 name2 number1 number2 matches 1 xxregular reGulr 1 2 5 2 kklmin <NA> 8 NA 0 3 notSo <NA> 9 NA 0 4 Jijoh Jijoh 2 12 5 5 Kish <NA> 18 NA 0 6 Lissp LiSsp 25 20 5 7 Kcn KcN 33 18 3 8 CCCa <NA> 8 NA 0 9 <NA> ntSo NA 8 0 10 <NA> sean NA 13 0 11 <NA> CaPN NA 13 0 
+7
source

agrep helps you get started.

sort of:

 lapply(tolower(datf1$name), function(x) agrep(x, tolower(datf2$name))) 

then you can configure the max.distance parameter until you get the appropriate number of matches. then combine as you like.

+4
source

Source: https://habr.com/ru/post/915901/


All Articles