Remove all duplicates from the Redshift database

Question

Remove all duplicates from the Redshift database

I have a very large Redshift database. Entries do not have unique keys or identifiers. I would like to remove all duplicates with the most efficient query.

Other stackoverflow questions about typical sql databases suggested copying the table and skipping duplicates during this process, but this seems suboptimal to a giant redshift database.

Any better solutions?

+6

sql amazon-web-services amazon-redshift

johncorser Jul 01 '14 at 14:08

source share

1 answer

MatBailie · Accepted Answer · 2014-07-01T14:32:46+0000

One thing to consider when using Redshift is that deleted records are only actually “soft” deleted until VACUUM is started.
- They remain in the table marked as ignored
- They are only removed after Vacuum

However, VACUUM on a large table with deletion scattered through it is very often actually slower than Deep Copy. (Duplicate the data into another table using GROUP BY or DISTINCT to eliminate duplicates, TRUNCATE original table and reinsert the data or delete the original table and rename the new table.)

This is a general rationale for why you can really benefit from what seems like a "slow" process.

In addition, if the two lines are really identical, then there is no way (by definition) to uniquely identify one line. In this case, you cannot distinguish between the one that needs to be saved and the ones that need to be deleted.

One "trick" in another DBMS is to use ROW_NUMBER() inside the general table expression, and then remove it from this CTE. (With a CTE that creates unique identifiers to identify individual lines that need to be saved or deleted.) Unfortunately, Redshift does not currently support deleting from CTE.

Until this changes, Deep Copy (copying to a separate table when using GROUP BY or DISTINCT ) is currently your only option.

However, the Deep Copy option may be more relevant in Redshift, even if removal from the CTE becomes possible.

EDIT:

Correction:

If any row in the Redshift table has been deleted, any subsequent VACUUM will process the entire table (regardless of where the deleted rows are or how many deleted rows are).

(It is more complex when VACUUMING follows INSERT, but down-right-ugly after DELETE.)

I also noticed that Deep Copy uses less disk space than VACUUM. (Which only caught my attention when we ran out of disk space ...)

EDIT:

Code example:

 CREATE TABLE blah_temp ( <Exactly the same DDL as the original table, especially Distribution and Sort keys> ) ; INSERT INTO blah_temp SELECT DISTINCT * FROM blah ; DROP TABLE blah; ALTER TABLE blah_temp RENAME TO blah;

Or...

 CREATE TABLE blah_temp ( <Exactly the same DDL as the original table, especially Distribution and Sort keys> ) ; INSERT INTO blah_temp SELECT * FROM blah GROUP BY a, b, c, d, e, f, g, etc ; TRUNCATE TABLE blah; INSERT INTO blah SELECT * FROM blah_temp ; DROP TABLE blah_temp;

Associated link: https://docs.aws.amazon.com/redshift/latest/dg/performing-a-deep-copy.html

Remove all duplicates from the Redshift database

More articles: