Select all rows containing duplicate values in one of two columns from separate groups of related records

Question

Select all rows containing duplicate values in one of two columns from separate groups of related records

I am trying to create a MySQL query that will return all individual rows (not grouped) containing duplicate values from a group of related records. By "groups of related records", I mean those with the same account number (as shown below).

In principle, within each group of related records that have the same separate account number, select only those rows whose values for the date or amount columns coincide with other row values in this account group. Values should only be considered duplicates of this group of accounts. The sample table and the ideal data for the output below should clarify the situation.

Also, I'm not interested in X status return records, even if they have duplicate values.

A small sample table with relevant data:

 id account invoice date amount status 1 1 1 2012-04-01 0 X 2 1 2 2012-04-01 120 P 3 1 2 2012-05-01 120 U 4 1 3 2012-05-01 117 U 5 2 4 2012-04-01 82 X 6 2 4 2012-05-01 82 U 7 2 5 2012-03-01 81 P 8 2 6 2012-05-01 80 U 9 3 7 2012-03-01 80 P 10 3 8 2012-04-01 79 U 11 3 9 2012-04-01 78 U

The ideal output is returned from the required SQL query:

 id account invoice date amount status 2 1 2 2012-04-01 120 P 3 1 2 2012-05-01 120 U 4 1 3 2012-05-01 117 U 6 2 4 2012-05-01 82 U 8 2 6 2012-05-01 80 U 10 3 8 2012-04-01 79 U 11 3 9 2012-04-01 78 U

Therefore, rows 7/9 and 8/9 should not be returned because their duplicate values are not considered duplicate within their respective accounts. However, line 8 must be returned because it has a duplicate value with line 6.

Later, I can hone the selection even further by capturing only duplicate rows that have the corresponding statuses, so row 2 will be excluded because it does not match the other two accounts found in this group of accounts. How much harder will the request be? Is it just a matter of adding a WHERE or HAVING clause, or is it more complicated?

Hopefully my explanation of what I'm trying to do makes sense. I tried using INNER JOIN, but it returns every desired row more than once. I do not want duplicate duplicates.

Table structure and sample values:

 CREATE TABLE payment ( id int(11) NOT NULL auto_increment, account int(10) NOT NULL default '0', invoice int(10) NOT NULL default '0', date date NOT NULL default '0000-00-00', amount int(10) NOT NULL default '0', status char(1) NOT NULL default '', PRIMARY KEY (id) ); INSERT INTO payment VALUES (1, 1, 1, '2012-04-01', 0, 'X'); INSERT INTO payment VALUES (2, 1, 2, '2012-04-01', 120, 'P'); INSERT INTO payment VALUES (3, 1, 2, '2012-05-01', 120, 'U'); INSERT INTO payment VALUES (4, 1, 3, '2012-05-01', 117, 'U'); INSERT INTO payment VALUES (5, 2, 4, '2012-04-01', 82, 'X'); INSERT INTO payment VALUES (6, 2, 4, '2012-05-01', 82, 'U'); INSERT INTO payment VALUES (7, 2, 5, '2012-03-01', 81, 'p'); INSERT INTO payment VALUES (8, 2, 6, '2012-05-01', 80, 'U'); INSERT INTO payment VALUES (9, 3, 7, '2012-03-01', 80, 'U'); INSERT INTO payment VALUES (10, 3, 8, '2012-04-01', 79, 'U'); INSERT INTO payment VALUES (11, 3, 9, '2012-04-01', 78, 'U');

+6

inner-join mysql duplicates having group-by

purefusion May 03 '12 at 13:29

source share

2 answers

It seems to work

 select * from payment p1 join payment p2 on (p1.id != p2.id and p1.status != 'X' and p1.account = p2.account and (p1.amount = p2.amount or p1.date = p2.date)) group by p1.id

http://sqlfiddle.com/#!2/a50e9/3

+3

goat May 03 '12 at 14:07

source share

Matt fenwick · Accepted Answer · 2012-05-03T13:53:50+0000

This type of request can be implemented as a semi join .

Semioins are used to select rows from one of the tables in a join.

For instance:

 select distinct l.* from payment l inner join payment r on l.id != r.id and l.account = r.account and (l.date = r.date or l.amount = r.amount) where l.status != 'X' and r.status != 'X' order by l.id asc;

Note the use of distinct and that I select columns only from the left table. This ensures no duplicates.

The join condition verifies that:

it does not attach the string to itself ( l.id != r.id )
rows are in the same account ( l.account = r.account )
and either the date or the amount is the same ( l.date = r.date or l.amount = r.amount )

For the second part of your question, you will need to update the on clause in the request.

Select all rows containing duplicate values ​​in one of two columns from separate groups of related records

A small sample table with relevant data:

The ideal output is returned from the required SQL query:

Table structure and sample values:

More articles:

Select all rows containing duplicate values in one of two columns from separate groups of related records