R - combined fuzzy and exact match

I have two tables containing addresses (street, city, zipcode and two fields containing concatenated values). I would like to do a fuzzy mapping on Zipcode, but only for those cases that have the same StrCity value. I started by first selecting only the addresses that match StrCity from the dictionary, and then fuzzy matching, but there are two problems:

1) if it matches Zipcode, it does not take into account the street and city 2) if it matches the address (contains all Zipcode, Street and City), it also returns possible values, where on the same zip code there is another street, which is close enough to the distance.

Maybe I need something like two different coincidences at the same time (one fuzzy and one exact), but I'm not sure how to implement it without killing computer performance.

Here is an example of TableAd data:

StrCity              ID      Zipcode Street       City     Address
BiaƂowiejskaWarszawa 5148676 01-459  BiaƂowiejska Warszawa 01-459BiaƂowiejskaWarszawa
BukowiƄskaWarszawa   6423687 02-730  BukowiƄska   Warszawa 02-730BukowiƄskaWarszawa
KanaƂowaWarszawa     6425093 03-536  KanaƂowa     Warszawa 03-536KanaƂowaWarszawa

And a sample dictionary:

Zipcode Street   City     Address                StrCity
02-882  Agaty    Warszawa 02-882AgatyWarszawa    AgatyWarszawa
03-663  KanaƂowa Warszawa 03-663KanaƂowaWarszawa KanaƂowaWarszawa
03-536  KoƂowa   Warszawa 03-536KoƂowaWarszawa   KoƂowaWarszawa

Here is my current code:

TableMatch <- merge(TableAd, TableDict, by="StrCity")
TableMatch <- TableMatch[, -grep("y", colnames(TableMatch))]
names(TableMatch)[names(TableMatch)=="Zipcode.x"] <- "Zipcode"
names(TableMatch)[names(TableMatch)=="Address.x"] <- "Address"

ResultTable <- TableMatch %>% 
  stringdist_left_join(TableDict, by="Address", distance_col="dist", method="lv", max_dist=5, ignore_case = TRUE) %>%
  select(ID, Zipcode.x, Address.x, Address.y, dist) %>% 
  group_by(Address.x) %>% 
  # select best fit record
  top_n(-1, dist)

The problem that I specifically found with the above example, the script, checks that strCity KanaƂowaWarszawa is present in the dictionary, but the distance of Levenshtein from the combined address bar is the same when changing the zipcode, as when changing the street to KoƂowa, which has the same zip code as verified. Here it returns both changes, but if there is a 2 or 1 digit difference in the zipcode, then it may incorrectly suggest replacing the street, while the zipcode should be changed.

Note. I use packages purrr, dplyrand fuzzyjoin.

+4
source share

Source: https://habr.com/ru/post/1691466/


All Articles