Deduplication of database records comparing values ​​in many fields

So, I am trying to clear phone records in a database table.

I learned how to find exact matches in two fields using:

/* DUPLICATE first & last names */ SELECT `First Name`, `Last Name`, COUNT(*) c FROM phone.contacts GROUP BY `Last Name`, `First Name` HAVING c > 1; 

Wow, great.

I want to expand it to look at numerous fields, to see if the phone number in 1 of 3 phone fields is a duplicate.

So, I want to check 3 fields ( general mobile , general phone , business phone ).

1. see that they are not empty ('') 2. see if data (number) is displayed in any of the two other telephone fields at any point in the table.

Thus, pushing my restricted SQL beyond, I came up with the following, which seems to return records with three empty phone fields, as well as records that do not have duplicate phone numbers.

 /* DUPLICATE general & business phone nos */ SELECT id, `first name`, `last name`, `general mobile`, `general phone`, `general email`, `business phone`, COUNT(CASE WHEN `general mobile` <> '' THEN 1 ELSE NULL END) as gen_mob, COUNT(CASE WHEN `general phone` <> '' THEN 1 ELSE NULL END) as gen_phone, COUNT(CASE WHEN `business phone` <> '' THEN 1 ELSE NULL END) as bus_phone FROM phone.contacts GROUP BY `general mobile`, `general phone`, `business phone` HAVING gen_mob > 1 OR gen_phone > 1 OR bus_phone > 1; 

It is clear that my logic is wrong, and I wondered if anyone could point me in the right direction / have mercy, etc ....

Thank you very much

+6
source share
3 answers

The first thing you need to do is shoot at the person who called your columns spaces in them.

Now try the following:

 SELECT DISTINCT c.id, c.`first name`, c.`last name`, c.`general mobile`, c.`general phone`, c.`business phone` from contacts_test c join contacts_test c2 on (c.`general mobile`!= '' and c.`general mobile` in (c2.`general phone`, c2.`business phone`)) or (c.`general phone` != '' and c.`general phone` in (c2.`general mobile`, c2.`business phone`)) or (c.`business phone`!= '' and c.`business phone` in (c2.`general mobile`, c2.`general phone`)) 

See a live demo of this query in SQLFiddle.

Pay attention to the additional check for phone != '' , Which is required because phone numbers are not NULL, therefore their "unknown" value is empty. Without this check, false matches are returned, because, of course, the space is empty.

The DISTINCT keyword was added if several other lines matched, resulting in a nxn result set.

+5
source

In my experience, when cleaning data, it is much better to have a clear view of the data and an easy way to manage it than to have a large and cumbersome query that does the whole analysis at once.

You can also (more or less) renormalize the database using something like:

 Create view VContactsWithPhones as Select id, `Last Name` as LastName, `First Name` as FirstName, `General Mobile` as Phone, 'General Mobile' as PhoneType From phone.contacts c UNION Select id, `Last Name`, `First Name`, `General Phone`, 'General Phone' From phone.contacts c UNION Select id, `Last Name`, `First Name`, `Business Phone`, 'Business Phone' From phone.contacts c 

This will create a view with triple rows in the source table, but with a Phone column, which can be one of three types.

You can easily choose from this view:

 //empty phones SELECT * FROM VContactsWithPhones Where Phone is null or Phone = '' //duplicate phones Select Phone, Count(*) from VContactsWithPhones where (Phone is not null and Phone <> '') -- exclude empty values group by Phone having count(*) > 1 //duplicate phones belonging to the same ID (double entries) Select Phone, ID, Count(*) from VContactsWithPhones where (Phone is not null and Phone <> '') -- exclude empty values group by Phone, ID having count(*) > 1 //duplicate phones belonging to the different ID (duplicate entries) Select v1.Phone, v1.ID, v1.PhoneType, v2.ID, v2.PhoneType from VContactsWithPhones v1 inner join VContactsWithPhones v2 on v1.Phone=v2.Phone and v1.ID=v2.ID where v1.Phone is not null and v1.Phone <> '' 

etc. etc.

+1
source

You can try something like:

 SELECT * from phone.contacts p WHERE `general mobile` IN (SELECT `general mobile` FROM phone.contacts WHERE id != p.id UNION SELECT `general phone` FROM phone.contacts WHERE id != p.id UNION SELECT `general email` FROM phone.contacts WHERE id != p.id) 

Repeat 3 times for each: general mobile , general phone and general email . It can be placed in one request, but will be less readable.

0
source

Source: https://habr.com/ru/post/951795/


All Articles