Similar names in a huge list

I have a database of 50,000 + companies that are constantly updated (200+ per month).

This is a huge problem with repetitive content, because the names are not always strict / correct:
"Shop Super 1"
"Super One Store"
"Super 1 shop"

Edit : another example that probably needs a different approach:
"Amy Pizza" <---> "Organic Pizza from Amy and Company"

We need a tool to scan data for similar names. I have some experience with Levenshtein Distance and LCS , but they work nice to compare if 2 lines are similar ...
Here I have to scan the 50,000 names that each can be with each, and calculate there ... the overall similarity rating ...

I need to advise how to attack this problem, the expected results are to have a list with 10-20 groups of very similar names and, possibly, further adjust the sensitivity to get more results.

+6
source share
3 answers

I had a similar problem a year ago or so, and if I remember well, I decided (more or less) to use similar_text and soundex , as other people said in the comments. Something like that:

 <?php $str1 = "Store 1 for you"; $str2 = "Store One 4 You"; similar_text(soundex($str1), soundex($str2), $percent); if ($percent >= 66){ echo "Equal"; //Send an email for review }else{ echo "Different"; //Proceed to insert in database } ?> 

In my case, use a percentage of 66% to determine that the companies are the same (in this case, do not paste into the database, but send me an email to check and see if this is correct.)

After a few months with the help of these solutions, I decided to use some unique code for companies (CIF in my case, because it is unique to the company here in Spain).

+3
source

purely it depends on how much we should endure considering 2 lines as similar. soundex may also be useful

 select soundex('Super One Store') returns S165236 select soundex('Super 1 Store'); returns S16236 select soundex('Super One Stores') returns S1652362 

S16236 IS COMMON IN ALL, you can use a filter as shown below

 select * from ( select 'Super One Store' as c union select 'Super 1 Store' as c union select 'Super One Stores' as c union select 'different one' as c union select 'supers stores' as c ) tmp where soundex(c) like CONCAT('%', soundex('Super store'), '%') or soundex(c) like CONCAT('%', soundex('Super one store'), '%') 
+1
source

I think you should manually go through this list of companies and create a table with a unique record for each company. Then use the many-to-one table in which you refer to different names in the correct company. I think that means normalization.

Table: companies :

 |id|base_name |1 |Super 1 Store 

Table: company_mapping :

 |id|company_id|name |1 |1 |Super 1 Store |2 |1 |Super One Store |3 |1 |Super 1 Stores 
-1
source

Source: https://habr.com/ru/post/958838/


All Articles