In fact, the issue with its current limitations is a rather difficult task. I have been thinking about the solution all evening (realizing that the solution will never be useful). I would not use the solution in the wild, I just tried to figure out whether it is possible to do this only using MySQL.
The question is in my wording: is it possible to write a series of DELETE statements that delete duplicate rows from a table with two columns without unique restrictions?
Problems:
- Lines
- do not have an identification key or primary key, so you should think of a way to refer to one line, which should remain
- we need to somehow group the lines, that is, apply the order and then the condition, but the
DELETE form, which supports ORDER BY , can only have a WHERE and does not support HAVING . This order applies after the condition is met. - we donโt need to sort the rows if the values โโare ordered using the cluster primary key, but we donโt have it.
Suppose we have a table:
CREATE TABLE `tablename` ( `a_id` int(10) unsigned NOT NULL, `b_id` int(10) unsigned NOT NULL, KEY `Index_1` (`a_id`,`b_id`) ) ENGINE=InnoDB COLLATE utf8_bin;
I added a key (not UNIQUE or PRIMARY) to speed up the search and hope to use it in groups.
You can copy a table with some values:
INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4); INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4); INSERT INTO tablename (a_id, b_id) VALUES (2, 3), (1, 1), (2, 2), (1,4);
As a side effect, the key has become a coverage index, and when we make SELECT from the table, the displayed values โโare sorted, but when we make exceptions, the values โโare read in the order in which we insert them.
Now consider the following query:
SELECT @c, @a_id as a, @b_id as b, a_id, b_id FROM tablename, (SELECT @a_id:=0, @b_id:=0, @c:=0) as init WHERE (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) >= 1 ;
And its result:
@c, a, b, a_id, b_id 1, 1, 1, 1, 1 2, 1, 1, 1, 1 3, 1, 1, 1, 1 1, 1, 4, 1, 4 2, 1, 4, 1, 4 3, 1, 4, 1, 4 1, 2, 2, 2, 2 2, 2, 2, 2, 2 3, 2, 2, 2, 2 1, 2, 3, 2, 3 2, 2, 3, 2, 3 3, 2, 3, 2, 3
Results are automatically sorted using Index_1 , and duplicate pairs (a_id, b_id) are listed in the @c column. Now our task is to delete all the lines where @c > 1 . The only problem we have is getting MySQL to use Index_1 when deleting, which is quite difficult without additional conditions. But we can do this using an equality check or several equality checks on a_id :
DELETE FROM t USING tablename t FORCE INDEX (Index_1) JOIN (SELECT @a_id:=0, @b_id:=0, @c:=0) as init WHERE a_id IN (1) AND (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) > 1; DELETE FROM t USING tablename t FORCE INDEX (Index_1) JOIN (SELECT @a_id:=0, @b_id:=0, @c:=0) as init WHERE a_id IN (2) AND (@c:=IF(LEAST(@a_id=(@a_id:=a_id), @b_id=(@b_id:=b_id)), @c+1, 1)) > 1; SELECT * FROM tablename t; a_id, b_id 1, 1 1, 4 2, 2 2, 3
I cannot put all possible a_id into IN() , because MySQL will understand that the index is useless in this case, and the query will not delete all duplicates (only adjacent ones), but, say, 10 different a_id I can delete duplicates in two DELETE statements , each IN will have 5 explicit identifiers.
Hope this can be useful to someone =)