Fuzzy merging in R - finding help to improve my code

Question

Fuzzy merging in R - finding help to improve my code

Inspired by the experimental fuzzy_join function from the fuzzy_join package , I wrote a function that combines exact and fuzzy (by string distances) matching. The merge task I have to do is quite large (as a result, I get several string distance matrices with a little less than one billion cells), and I got the impression that the fuzzy_join function is fuzzy_join written very efficiently (regarding memory usage) and parallelization is implemented in a strange way (the calculation of the row distance matrices, if there are several fuzzy variables, and not the calculation of the row distances themselves, is parallelized). As for the fuzzy_join function, the idea is to match exact variables as fuzzy_join as possible (to reduce matrices), and then move on to fuzzy matching within these precisely matched groups. In fact, I believe that this function requires no explanation. I post it here because I would like to get some feedback to improve it, and because I think I'm not the only one trying to do such things in R (although I admit that Python, SQL, and similar things are probably , will be more effective in this context, but you need to adhere to who you feel best with and do the cleaning and preparation of data in one language well in terms of reproducibility).

 merge.fuzzy = function(a,b,.exact,.fuzzy,.weights,.method,.ncores) { require(stringdist) require(matrixStats) require(parallel) if (length(.fuzzy)!=length(.weights)) { stop(paste0("fuzzy and weigths must have the same length")) } if (!any(class(a)=="data.table")) { stop(paste0("'a' must be of class data.table")) } if (!any(class(b)=="data.table")) { stop(paste0("'b' must be of class data.table")) } #convert everything to lower a[,c(.fuzzy):=lapply(.SD,tolower),.SDcols=.fuzzy] b[,c(.fuzzy):=lapply(.SD,tolower),.SDcols=.fuzzy] a[,c(.exact):=lapply(.SD,tolower),.SDcols=.exact] b[,c(.exact):=lapply(.SD,tolower),.SDcols=.exact] #create ids a[,"id.a":=as.numeric(.I),by=c(.exact,.fuzzy)] b[,"id.b":=as.numeric(.I),by=c(.exact,.fuzzy)] c <- unique(rbind(a[,.exact,with=FALSE],b[,.exact,with=FALSE])) c[,"exa.id":=.GRP,by=.exact] a <- merge(a,c,by=.exact,all=FALSE) b <- merge(b,c,by=.exact,all=FALSE) ############## stringdi <- function(a,b,.weights,.by,.method,.ncores) { sdm <- list() if (is.null(.weights)) {.weights <- rep(1,length(.by))} if (nrow(a) < nrow(b)) { for (i in 1:length(.by)) { sdm[[i]] <- stringdistmatrix(a[[.by[i]]],b[[.by[i]]],method=.method,ncores=.ncores,useNames=TRUE) } } else { for (i in 1:length(.by)) { #if a is shorter, switch sides; this enhances parallelization speed sdm[[i]] <- stringdistmatrix(b[[.by[i]]],a[[.by[i]]],method=.method,ncores=.ncores,useNames=FALSE) } } rsdm = dim(sdm[[1]]) csdm = ncol(sdm[[1]]) sdm = matrix(unlist(sdm),ncol=length(by)) sdm = rowSums(sdm*.weights,na.rm=T)/((0 + !is.na(sdm)) %*% .weights) sdm = matrix(sdm,nrow=rsdm,ncol=csdm) #use ids as row/ column names rownames(sdm) <- a$id.a colnames(sdm) <- b$id.b mid <- max.col(-sdm,ties.method="first") mid <- matrix(c(1:nrow(sdm),mid),ncol=2) bestdis <- sdm[mid] res <- data.table(as.numeric(rownames(sdm)),as.numeric(colnames(sdm)[mid[,2]]),bestdis) setnames(res,c("id.a","id.b","dist")) res } setkey(b,exa.id) distances = a[,stringdi(.SD,b[J(.BY[[1]])],.weights=.weights,.by=.fuzzy,.method=.method,.ncores=.ncores),by=exa.id] a = merge(a,distances,by=c("exa.id","id.a")) res = merge(a,b,by=c("exa.id","id.b")) res }

The following points are interesting:

I'm not quite sure how to encode a few exact matching variables in the data.table style that I used above (which I believe is the fasted option).
Is it possible to have nested parallelization? This means that you can use the parallel foreach loop on top of calculating the row distance matrices.
I'm also interested in ideas on how to make everything more efficient, i.e. consume less memory.
Perhaps you can suggest a larger “real world” dataset so I can create a woking example. Unfortunately, I cannot share even small samples of my data with you.
In the future, it would be nice to do something else than the classic left inner join. Therefore, ideas on this topic are greatly appreciated.

All your comments are welcome!

+6

parallel-processing r data.table fuzzy-comparison stringdist

chameau13 Apr 4 '15 at 17:38

source share

No one has answered this question yet.

See related questions:

1068

How to merge (merge) data frames (internal, external, left, right)

466

How to view the source code for a function?

136