MySQL: Eliminate duplicate rows without violating a foreign key constraint

Question

MySQL: Eliminate duplicate rows without violating a foreign key constraint

I have a customer database populated with normalized addresses. There are duplicates.

Each user created their own entry and entered their own address. Thus, we have a 1 to 1 relationship between users and addresses:

CREATE TABLE `users` ( `UserID` INT UNSIGNED NOT NULL AUTO_INCREMENT, `Name` VARCHAR(63), `Email` VARCHAR(63), `AddressID` INT UNSIGNED, PRIMARY KEY (`UserID`) USING BTREE ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE `addresses` ( `AddressID` INT UNSIGNED NOT NULL AUTO_INCREMENT, `Duplicate` VARCHAR(1), `Address1` VARCHAR(63) DEFAULT NULL, `Address2` VARCHAR(63) DEFAULT NULL, `City` VARCHAR(63) DEFAULT NULL, `State` VARCHAR(2) DEFAULT NULL, `ZIP` VARCHAR(10) DEFAULT NULL, PRIMARY KEY (`AddressID`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;

And the data:

 INSERT INTO `users` VALUES (1, 'Michael', ' michael@email.com ', 1), (2, 'Steve', ' steve@email.com ', 2), (3, 'Judy', ' judy@email.com ', 3), (4, 'Kathy', ' kathy@email.com ', 4), (5, 'Mark', ' mark@email.com ', 5), (6, 'Robert', ' robert@email.com ', 6), (7, 'Susan', ' susan@email.com ', 7), (8, 'Paul', ' paul@email.com ', 8), (9, 'Patrick', ' patrick@email.com ', 9), (10, 'Mary', ' mary@email.com ', 10), (11, 'James', ' james@email.com ', 11), (12, 'Barbara', ' barbara@email.com ', 12), (13, 'Peter', ' peter@email.com ', 13); INSERT INTO `addresses` VALUES (1, '', '1234 Main Street', '', 'Springfield', 'KS', '54321'), (2, 'Y', '1234 Main Street', '', 'Springfield', 'KS', '54321'), (3, 'Y', '1234 Main Street', '', 'Springfield', 'KS', '54321'), (4, '', '5678 Sycamore Lane', '', 'Upstate', 'NY', '50000'), (5, '', '1000 State Street', 'Apt C', 'Sunnydale', 'OH', '54321'), (6, 'Y', '1234 Main Street', '', 'Springfield', 'KS', '54321'), (7, 'Y', '1000 State Street', 'Apt C', 'Sunnydale', 'OH', '54321'), (8, 'Y', '1234 Main Street', '', 'Springfield', 'KS', '54321'), (9, '', '1000 State Street', 'Apt A', 'Sunnydale', 'OH', '54321'), (10, 'Y', '1234 Main Street', '', 'Springfield', 'KS', '54321'), (11, 'Y', '5678 Sycamore Lane', '', 'Upstate', 'NY', '50000'), (12, 'Y', '1000 Main Street', 'Apt A', 'Sunnydale', 'OH', '54321'), (13, '', '9999 Valleyview', '', 'Springfield', 'KS', '54321');

Oh yes, let me add that relation of foreign keys:

 ALTER TABLE `users` ADD CONSTRAINT `AddressID` FOREIGN KEY `AddressID` (`AddressID`) REFERENCES `addresses` (`AddressID`);

We had a list of addresses cleared by a third-party service that normalized the data and indicated where we had duplicates. This is where the Duplicate column came from. If there is a "Y", this is a duplicate of another address. The primary address is NOT marked as duplicate, as shown in the sample data.

I obviously want to delete all duplicate entries, but there are user entries that point to them. I need them to point to a version of the address that is not a duplicate.

So how can I update the AddressID in users to match non-dual addresses?

The only way I can do this is to repeat all the data using a high-level language, but I am sure that MySQL has all the tools necessary to make something like this better.

Here is what I tried:

 SELECT COUNT(*) as cnt, GROUP_CONCAT(AddressID ORDER BY AddressID) AS ids FROM addresses GROUP BY Address1, Address2, City, State, ZIP HAVING cnt > 1; +-----+--------------+ | cnt | ids | +-----+--------------+ | 2 | 5,7 | | 6 | 1,2,3,6,8,10 | | 2 | 4,11 | +-----+--------------+ 3 rows in set (0.00 sec)

From there, I could iterate over each line of the result and do this:

 UPDATE `users` SET `AddressID` = 1 WHERE `AddressID` IN (2,3,6,8,10);

But there must be a better way only for MySQL, shouldn't it?

When all is said and done, the data MUST look like this:

 SELECT * FROM `users`; +--------+---------+-------------------+-----------+ | UserID | Name | Email | AddressID | +--------+---------+-------------------+-----------+ | 1 | Michael | michael@email.com | 1 | | 2 | Steve | steve@email.com | 1 | | 3 | Judy | judy@email.com | 1 | | 4 | Kathy | kathy@email.com | 4 | | 5 | Mark | mark@email.com | 5 | | 6 | Robert | robert@email.com | 1 | | 7 | Susan | susan@email.com | 5 | | 8 | Paul | paul@email.com | 1 | | 9 | Patrick | patrick@email.com | 9 | | 10 | Mary | mary@email.com | 1 | | 11 | James | james@email.com | 4 | | 12 | Barbara | barbara@email.com | 1 | | 13 | Peter | peter@email.com | 13 | +--------+---------+-------------------+-----------+ 13 rows in set (0.00 sec) SELECT * FROM `addresses`; +-----------+-----------+--------------------+----------+-------------+-------+-------+ | AddressID | Duplicate | Address1 | Address2 | City | State | ZIP | +-----------+-----------+--------------------+----------+-------------+-------+-------+ | 1 | | 1234 Main Street | | Springfield | KS | 54321 | | 4 | | 5678 Sycamore Lane | | Upstate | NY | 50000 | | 5 | | 1000 State Street | Apt C | Sunnydale | OH | 54321 | | 9 | | 1000 State Street | Apt A | Sunnydale | OH | 54321 | | 13 | | 9999 Valleyview | | Springfield | KS | 54321 | +-----------+-----------+--------------------+----------+-------------+-------+-------+ 5 rows in set (0.00 sec)

reference

+6

sql php mysql duplicate-removal normalization

pbarney Nov 27 '13 at 2:28

source share

2 answers

You have many, one relationships between users and addresses (that is, multiple users can map the same address). It seems a little strange to me, but I think it might be useful. Many-to-many will make more sense, that is, a user can have several addresses, but the same address can be used by several users. Typically, a single user has multiple addresses. Updating your circuit might help, but I digress.

 UPDATE users -- We only care about users mapped to duplicate addresses JOIN addresses dupe ON (users.AddressID = dupe.AddressID AND dupe.Duplicate='Y') -- If your normalizer thingy worked right, these will all map to non-duplicates JOIN addresses nondupe ON (dupe.Address1 = nondupe.Address1 -- Compare to other columns if you want AND nondupe.Duplicate = '') -- Set to the nondupe ID SET users.AddressID = nondupe.AddressID;

http://sqlfiddle.com/#!2/5d303/1

+2

Explosion pills Nov 27 '13 at 2:42

source share

Webchemist · Accepted Answer · 2013-11-27T03:16:05+0000

To select the results you want to see:

 SELECT a.UserID ,a.Name ,a.Email ,( SELECT addressID FROM addresses c WHERE c.Address1 = b.Address1 AND c.Address2 = b.Address2 AND c.City = b.City AND c.State = b.State AND c.ZIP = b.ZIP AND DUPLICATE != 'Y' ) as AddressID FROM users a JOIN addresses b ON a.AddressID = b.AddressID

This will update the user table to the results specified in the query above.

 UPDATE users a JOIN addresses b ON a.AddressID = b.AddressID SET a.addressID = ( SELECT addressID FROM addresses c WHERE c.Address1 = b.Address1 AND c.Address2 = b.Address2 AND c.City = b.City AND c.State = b.State AND c.ZIP = b.ZIP AND Duplicate != 'Y' ) WHERE Duplicate = 'Y'

Please note that with sample data No. 12 you provided, the Barbara identifier is null in the SELECT query, since its address is marked as a duplicate, if in fact it is unique to the list provided. It does not match address 1, as indicated in the results of "how it should look."

Edit

To handle invalid duplicate flags, such as # 12 Barbara, or possibly other missing duplicates that were not flagged as such, you can skip checking the duplicate flag column and just use ORDER BY and LIMIT in the subquery so that it returns the first smallest identifier match addresses, regardless of duplicate character:

 UPDATE users a JOIN addresses b ON a.AddressID = b.AddressID SET a.addressID = ( SELECT addressID FROM addresses c WHERE c.Address1 = b.Address1 AND c.Address2 = b.Address2 AND c.City = b.City AND c.State = b.State AND c.ZIP = b.ZIP ORDER BY c.addressID ASC LIMIT 1 )

MySQL: Eliminate duplicate rows without violating a foreign key constraint

More articles: