I have two tables containing addresses (street, city, zipcode and two fields containing concatenated values). I would like to do a fuzzy mapping on Zipcode, but only for those cases that have the same StrCity value. I started by first selecting only the addresses that match StrCity from the dictionary, and then fuzzy matching, but there are two problems:
1) if it matches Zipcode, it does not take into account the street and city 2) if it matches the address (contains all Zipcode, Street and City), it also returns possible values, where on the same zip code there is another street, which is close enough to the distance.
Maybe I need something like two different coincidences at the same time (one fuzzy and one exact), but I'm not sure how to implement it without killing computer performance.
Here is an example of TableAd data:
StrCity ID Zipcode Street City Address
BiaĆowiejskaWarszawa 5148676 01-459 BiaĆowiejska Warszawa 01-459BiaĆowiejskaWarszawa
BukowiĆskaWarszawa 6423687 02-730 BukowiĆska Warszawa 02-730BukowiĆskaWarszawa
KanaĆowaWarszawa 6425093 03-536 KanaĆowa Warszawa 03-536KanaĆowaWarszawa
And a sample dictionary:
Zipcode Street City Address StrCity
02-882 Agaty Warszawa 02-882AgatyWarszawa AgatyWarszawa
03-663 KanaĆowa Warszawa 03-663KanaĆowaWarszawa KanaĆowaWarszawa
03-536 KoĆowa Warszawa 03-536KoĆowaWarszawa KoĆowaWarszawa
Here is my current code:
TableMatch <- merge(TableAd, TableDict, by="StrCity")
TableMatch <- TableMatch[, -grep("y", colnames(TableMatch))]
names(TableMatch)[names(TableMatch)=="Zipcode.x"] <- "Zipcode"
names(TableMatch)[names(TableMatch)=="Address.x"] <- "Address"
ResultTable <- TableMatch %>%
stringdist_left_join(TableDict, by="Address", distance_col="dist", method="lv", max_dist=5, ignore_case = TRUE) %>%
select(ID, Zipcode.x, Address.x, Address.y, dist) %>%
group_by(Address.x) %>%
# select best fit record
top_n(-1, dist)
The problem that I specifically found with the above example, the script, checks that strCity KanaĆowaWarszawa is present in the dictionary, but the distance of Levenshtein from the combined address bar is the same when changing the zipcode, as when changing the street to KoĆowa, which has the same zip code as verified. Here it returns both changes, but if there is a 2 or 1 digit difference in the zipcode, then it may incorrectly suggest replacing the street, while the zipcode should be changed.
Note. I use packages purrr
, dplyr
and fuzzyjoin
.