R - combined fuzzy and exact match

Question

R - combined fuzzy and exact match

I have two tables containing addresses (street, city, zipcode and two fields containing concatenated values). I would like to do a fuzzy mapping on Zipcode, but only for those cases that have the same StrCity value. I started by first selecting only the addresses that match StrCity from the dictionary, and then fuzzy matching, but there are two problems:

1) if it matches Zipcode, it does not take into account the street and city 2) if it matches the address (contains all Zipcode, Street and City), it also returns possible values, where on the same zip code there is another street, which is close enough to the distance.

Maybe I need something like two different coincidences at the same time (one fuzzy and one exact), but I'm not sure how to implement it without killing computer performance.

Here is an example of TableAd data:

StrCity              ID      Zipcode Street       City     Address
BiałowiejskaWarszawa 5148676 01-459  Białowiejska Warszawa 01-459BiałowiejskaWarszawa
BukowińskaWarszawa   6423687 02-730  Bukowińska   Warszawa 02-730BukowińskaWarszawa
KanałowaWarszawa     6425093 03-536  Kanałowa     Warszawa 03-536KanałowaWarszawa

And a sample dictionary:

Zipcode Street   City     Address                StrCity
02-882  Agaty    Warszawa 02-882AgatyWarszawa    AgatyWarszawa
03-663  Kanałowa Warszawa 03-663KanałowaWarszawa KanałowaWarszawa
03-536  Kołowa   Warszawa 03-536KołowaWarszawa   KołowaWarszawa

Here is my current code:

TableMatch <- merge(TableAd, TableDict, by="StrCity")
TableMatch <- TableMatch[, -grep("y", colnames(TableMatch))]
names(TableMatch)[names(TableMatch)=="Zipcode.x"] <- "Zipcode"
names(TableMatch)[names(TableMatch)=="Address.x"] <- "Address"

ResultTable <- TableMatch %>% 
  stringdist_left_join(TableDict, by="Address", distance_col="dist", method="lv", max_dist=5, ignore_case = TRUE) %>%
  select(ID, Zipcode.x, Address.x, Address.y, dist) %>% 
  group_by(Address.x) %>% 
  # select best fit record
  top_n(-1, dist)

The problem that I specifically found with the above example, the script, checks that strCity KanałowaWarszawa is present in the dictionary, but the distance of Levenshtein from the combined address bar is the same when changing the zipcode, as when changing the street to Kołowa, which has the same zip code as verified. Here it returns both changes, but if there is a 2 or 1 digit difference in the zipcode, then it may incorrectly suggest replacing the street, while the zipcode should be changed.

Note. I use packages purrr, dplyrand fuzzyjoin.

+4

matching r

Przem Dec 28 '17 at 14:03

source share

No one has answered this question yet.

See related questions:

8

Exact match and GenMatch in R

4

Order-independent fuzzy match "First Name Last Name" / "Last Name First Name" in R?

2