SQL Server: Correct Incorrect Company Names

I am looking for advice on how to solve a problem with different spelling with the same name. I have a SQL Server database with company names, and there are some companies that are the same, but the spelling is different.

For instance:

Building Supplies pty Buidings Supplies pty Building Supplied l/d 

The problem is that there is no clear consistency in the variation. Sometimes it’s superfluous, at another time - additional space.

Unfortunately, I don't have a search list, so I cannot use Fuzzy LookUp. I need to create a clean list.

Is there a method that people use to solve this problem?

ps I tried to find this problem but cannot find a similar thread

thanks

+5
source share
2 answers

You can use SOUNDEX() DIFFERENCE() for this purpose.

 DECLARE @SampleData TABLE(ID INT, BLD VARCHAR(50), SUP VARCHAR(50)) INSERT INTO @SampleData SELECT 1, 'Building','Supplies' UNION SELECT 2, 'Buidings','Supplies' UNION SELECT 3, 'Biulding','Supplied' UNION SELECT 4, 'Road','Contractor' UNION SELECT 5, 'Raod','Consractor' UNION SELECT 6, 'Highway','Supplies' SELECT *, DIFFERENCE('Building', BLD) AS DIF FROM @SampleData WHERE DIFFERENCE('Building', BLD) >= 3 

Result

 ID BLD SUP DIF 1 Building Supplies 4 2 Buidings Supplies 3 3 Biulding Supplied 4 

If this serves your purpose, you can write an update request to update the selected record accordingly.

+3
source

Besides the SOUNDEX () DIFFERENCE () option (which is a very good cue ball!), You can take a look at SSIS more.

If your data is written in English, and not just the names of people, you can do a lot with these components:

Highlighting the term

Search by date

Fuzzy grouping

Fuzzy search

The main thread will be a multi-level structure in which you are trying to find duplicates with more and more defined ways. Instead of automatically applying them, you send all the names and keys you need to apply the changes to the staging area where they can be viewed and, if necessary.

If you go really smart, you can use the scanned data as a repository in order to make the package β€œlearn”, for example, iu is hardly ever valid in English, so if it is detected and changes it to ui, he will make a valid English word that you might want to start applying automatically at some point.

Another thing to keep in mind is to keep a list of all confirmed names and use them to check for duplicates of these names and prevent unnecessary recursion / loading when checking the source data.

0
source

Source: https://habr.com/ru/post/1205832/


All Articles