Delete duplicate rows in Vertica database

Question

Delete duplicate rows in Vertica database

Vertica allows you to insert duplicates in tables. I can view them using the function "anal_constraints". How to remove duplicate rows from Vertica tables?

+4

vertica

Amit Jun 19 '13 at 20:00

source share

5 answers

mauro · Answer 1 · 2015-12-28T11:57:39+0000

You should try to avoid / limit the use of DELETE to a large number of records. The following approach should be more efficient:

Step 1 Create a new table with the same structure / projections as the duplicates:

create table mytable_new like mytable including projections ;

Step 2 Insert the duplicated rows in this new table:

 insert /* +direct */ into mytable_new select <column list> from ( select * , row_number() over ( partition by <pk column list> ) as rownum from <table-name> ) a where a.rownum = 1 ;

Step 3 rename the source table (the one that contains the duplicates):

 alter table mytable rename to mytable_orig ;

Step 4 rename the new table:

 alter table mytable_new rename to mytable ;

What all.

geoffrobinson · Answer 2 · 2013-06-20T18:11:32+0000

Above my head, not a great answer, so let it be the last word, you can delete them and paste them back.

stevendv · Answer 3 · 2016-11-29T08:26:02+0000

Mauro's answer is correct, but there is an error in the sql of step 2. Thus, the full way of working, avoiding DELETE, should be as follows:

Step 1 Create a new table with the same structure / projections as the duplicates:

 create table mytable_new like mytable including projections ;

Step 2 Insert the duplicated rows in this new table:

 insert /* +direct */ into mytable_new select <column list> from ( select * , row_number() over ( partition by <pk column list> ) as rownum from mytable ) a where a.rownum = 1 ;

Step 3 rename the source table (the one that contains the duplicates):

 alter table mytable rename to mytable_orig ;

Step 4 rename the new table:

 alter table mytable_new rename to mytable ;

Shiva · Answer 4 · 2015-03-17T22:11:53+0000

You can delete duplicates on Vertica tables by creating a temporary table and generating pseudo row_ids. Here are a few steps, especially if you are removing duplicates from very large and wide tables. In the example below, I assume that lines k1 and k2 have more than 1 duplicate. For more information see here .

 -- Find the duplicates select keys, count(1) from large-table-1 where [where-conditions] group by 1 having count(1) > 1 order by count(1) desc ; -- Step 2: Dump the duplicates into temp table create table test.large-table-1-dups like large-table-1; alter table test.large-table-1-dups -- add row_num column (pseudo row_id) add column row_num int; insert into test.large-table-1-dups select *, ROW_NUMBER() OVER(PARTITION BY key) from large-table-1 where key in ('k1', 'k2'); -- where, say, k1 has n and k2 has m exact dups -- Step 3: Remove duplicates from the temp table delete from test.large-table-1-dups where row_num > 1; select * from test.dim_line_items_dups; -- Sanity test. Should have 1 row each of k1 & k2 rows above -- Step 4: Delete all duplicates from main table... delete from large-table-1 where key in ('k1', 'k2'); -- Step 5: Insert data back into main table from temp dedupe data alter table test.large-table-1-dups drop column row_num; insert into large-table-1 select * from test.large-table-1-dups;

Basil · Answer 5 · 2015-11-26T10:04:36+0000

You should look at this answer from the PostgreSQL wiki , which also works for Vertica:

 DELETE FROM tablename WHERE id IN( SELECT id FROM ( SELECT id, ROW_NUMBER() OVER( partition BY column1, column2, column3 ORDER BY id ) AS rnum FROM tablename ) t WHERE t.rnum > 1 );

Deletes all duplicate entries, but with the smallest id.

Delete duplicate rows in Vertica database

More articles: