What is the best way to compare data when importing to a database?

I have a MySQL database table containing information about 1000 stores. Now I will import more stores by downloading the Excel distribution sheet, and I try to avoid duplicates.

  • Stores may have the same name, but never match the address.
  • Stores may have the same address, but not the same name

But here is my problem.

  • Stores may be in error
  • The address may be in error

I am currently importing data into a temporary table. Now I'm wondering what is the best approach to compare imported stores with existing ones.

My plan is to go through each line and compare stores.

  • First compare a.name = b.name AND a.street = b.street. In the match, the store is deleted.
  • Then I will make a comparison of Levenshtein by name and street. Here, I probably have to manually look at the results to determine if this is a duplicate.

Does anyone have experience with this data mapping?

Update
Thanks for the good answers.

Fields to be used for comparison:

  • name
  • Street address
  • Postcode
  • town
  • The country

I think something like that:

Select the lines where name = Lavenshtein and country = country.
Thus, I only need to work with a small list.

Then I can begin to more thoroughly compare the name and address.

+6
source share
2 answers

Levenshtein-distance is a way to do this, and you can avoid manual input. But the actual implementation will depend on any prior knowledge of the data. How and how many mistakes you expect in writing.

Suppose, for example, that this is good quality data, and you only expect typos, can you generate a matching condition based on: 1) the same number of words? 2) the sequence of these words. 3) A small threshold of permissible error in Levenshtein is the distance for each word in the title.

Conditions can be strengthened by checking the address with a similar condition when there is ambiguity by name or vice versa.

+4
source

To expand on my comment on Shaunak, the approach I took when I did this was:

Make a series of "hashes" of each name in order of priority. For example, for a group of financial companies in the UK, I used the following:

  • "Hash" 1: the exact name of the company, for example. "St. John and James Financial Investments Ltd.".
  • Hash 2: company name, with all non-alphanumeric characters removed and normalized spaces: "St John James Financial Investments Ltd"
  • Hash 3: Extended general abbreviations, for example. “1st” - “First”, “LLC” - “Limited”: “St. John James Finance Investments Limited”. I also made common spelling mistakes here, for example. Independent - Independent. Obviously, your common mistakes will depend on your raw data.
  • Hash 4: the words Stop, for example. "The", "Limited," deleted: "St. John James Financial Investment."

I put all this into a table and then built a query to compare each company name in the table with everyone else, based on their correspondence hash-1, hash-2, hash-3, hash code 4. This gave me a confidence factor - the higher the number of agreed hashes, the greater the likelihood that the companies were actually the same. (Note that with the approach I took, if hash 1 matches, hashes 2, 3, and 4 are guaranteed to match, and so on down the line.)

(Check also for empty hashes and ignore - "Limited company" can reduce to an empty hash, but you do not want it to match other empty hashes.)

I used this approach with Levenshtein to filter out my possible duplicates in order of likelihood - how do you decide where, in accordance with this possible approach, your data will depend on Levenshtein; you might want to apply Levenshein to the third hash rather than the actual name, for example, and present a list of possible duplicates for my business people to make a final decision, automatically matching the highest trust matches, where I was sure that the names really represented the same same company.

Looking at my code, I also used a hash that was Soundex converting each word after removing stop words, etc., although my comments indicate that Metaphone would be better (I used SQL Server, so Soundex was built in ... )

+3
source

Source: https://habr.com/ru/post/893754/


All Articles