To expand on my comment on Shaunak, the approach I took when I did this was:
Make a series of "hashes" of each name in order of priority. For example, for a group of financial companies in the UK, I used the following:
- "Hash" 1: the exact name of the company, for example. "St. John and James Financial Investments Ltd.".
- Hash 2: company name, with all non-alphanumeric characters removed and normalized spaces: "St John James Financial Investments Ltd"
- Hash 3: Extended general abbreviations, for example. “1st” - “First”, “LLC” - “Limited”: “St. John James Finance Investments Limited”. I also made common spelling mistakes here, for example. Independent - Independent. Obviously, your common mistakes will depend on your raw data.
- Hash 4: the words Stop, for example. "The", "Limited," deleted: "St. John James Financial Investment."
I put all this into a table and then built a query to compare each company name in the table with everyone else, based on their correspondence hash-1, hash-2, hash-3, hash code 4. This gave me a confidence factor - the higher the number of agreed hashes, the greater the likelihood that the companies were actually the same. (Note that with the approach I took, if hash 1 matches, hashes 2, 3, and 4 are guaranteed to match, and so on down the line.)
(Check also for empty hashes and ignore - "Limited company" can reduce to an empty hash, but you do not want it to match other empty hashes.)
I used this approach with Levenshtein to filter out my possible duplicates in order of likelihood - how do you decide where, in accordance with this possible approach, your data will depend on Levenshtein; you might want to apply Levenshein to the third hash rather than the actual name, for example, and present a list of possible duplicates for my business people to make a final decision, automatically matching the highest trust matches, where I was sure that the names really represented the same same company.
Looking at my code, I also used a hash that was Soundex converting each word after removing stop words, etc., although my comments indicate that Metaphone would be better (I used SQL Server, so Soundex was built in ... )
source share