T-SQL Soundex / Difference for finding duplicate rows

I have an outdated database with: first name, last name, address1, address2, address3, address4, zip code Data is scattered between different columns without consistency, for example, the actual zipcode can be in any column, and there are many typos.

Is there a way to use something like SOUNDEX / DIFFERENCE in SP to iterate over everything and return an ordered list of likely duplicates? [he should not be fast]

+3
source share
2 answers

If you are using SQl 2005 server or higher, you can use fuzzy mapping in SSIS to complete this task. I found that I have significantly better results in this than when searching for soundex matches or writing my own SQL code to find close matches.

+3
source

If you simply want to get probable duplicates checksum / binary_checksum will give you a good indication, although it is only a 32-bit hash, so depending on your data set size, you can get a few false positives. checksum () is not case sensitive, binary_checksum () is case sensitive. This will give you a 32-bit hash for each entry in your table:

select   checksum(*), binary_checksum(*)
from     tableName;

ID ( .. , ). :

select   id, checksum(*)
from     tableName a
join     tableName b
on       a.checksum(*) = b.checksum(*)
and      a.id <> b.id;

2 , , fName, lName, address .., , :

checksum(a.fName, a.lName, a.address, ...)

(*), .

+1

Source: https://habr.com/ru/post/1725184/


All Articles