My goal, in the great scheme of things, is to print only lines that have the same name of the same field without repeating. That is, if three lines are duplicated, print each of them only once (and not each pairwise comparison).
Minimum data set and library to play:
library(stringdist)
trye <- data.frame(names = c('aa','aa','aa','bb','bb','cc'),
values = 1:6,
id = c('row 1', 'row 2', 'row 3', 'row 4', 'row 5', 'row 6'),
stringsAsFactors = FALSE)
My expected result will consist of strings that have the same / similar name (1,2,3,4 and 5):
trye
Here are two attempts that did not work (some other modifications caused errors):
i <- 1
while (i < length(trye$names)) {
dupe <- amatch(trye$names[[i]],trye$names[-i], maxDist = 1)
if(dupe + 1 > 0) {
print(trye[i,])
duperow <- dupe + 1
print(trye[duperow,])
trye <- trye[-c(i), ]
i <- i + 1
} else {
i <- i + 1
trye <- trye[-c(i), ]
}
}
i <- 1
while (i < length(trye$names)) {
dupe <- amatch(trye$names[[i]],trye$names[-i], maxDist = 1)
if(dupe + 1 > 0) {
print(trye[i,])
duperow <- dupe + 1
print(trye[duperow,])
trye <- trye[-c(i,duperow), ]
i <- i + 1
} else {
i <- i + 1
trye <- trye[-c(i,duperow), ]
}
}
Note that the actual dataset is huge, so deleting rows to make comparisons smaller seems (or seemed) like a good idea to me, also the maximum distance in the actual dataset is more than 1.