Delete duplicate rows in Vertica database

Vertica allows you to insert duplicates in tables. I can view them using the function "anal_constraints". How to remove duplicate rows from Vertica tables?

+4
source share
5 answers

You should try to avoid / limit the use of DELETE to a large number of records. The following approach should be more efficient:

Step 1 Create a new table with the same structure / projections as the duplicates:

create table mytable_new like mytable including projections ; 

Step 2 Insert the duplicated rows in this new table:

 insert /* +direct */ into mytable_new select <column list> from ( select * , row_number() over ( partition by <pk column list> ) as rownum from <table-name> ) a where a.rownum = 1 ; 

Step 3 rename the source table (the one that contains the duplicates):

 alter table mytable rename to mytable_orig ; 

Step 4 rename the new table:

 alter table mytable_new rename to mytable ; 

What all.

+3
source

Above my head, not a great answer, so let it be the last word, you can delete them and paste them back.

+2
source

Mauro's answer is correct, but there is an error in the sql of step 2. Thus, the full way of working, avoiding DELETE, should be as follows:

Step 1 Create a new table with the same structure / projections as the duplicates:

 create table mytable_new like mytable including projections ; 

Step 2 Insert the duplicated rows in this new table:

 insert /* +direct */ into mytable_new select <column list> from ( select * , row_number() over ( partition by <pk column list> ) as rownum from mytable ) a where a.rownum = 1 ; 

Step 3 rename the source table (the one that contains the duplicates):

 alter table mytable rename to mytable_orig ; 

Step 4 rename the new table:

 alter table mytable_new rename to mytable ; 
+2
source

You can delete duplicates on Vertica tables by creating a temporary table and generating pseudo row_ids. Here are a few steps, especially if you are removing duplicates from very large and wide tables. In the example below, I assume that lines k1 and k2 have more than 1 duplicate. For more information see here .

 -- Find the duplicates select keys, count(1) from large-table-1 where [where-conditions] group by 1 having count(1) > 1 order by count(1) desc ; -- Step 2: Dump the duplicates into temp table create table test.large-table-1-dups like large-table-1; alter table test.large-table-1-dups -- add row_num column (pseudo row_id) add column row_num int; insert into test.large-table-1-dups select *, ROW_NUMBER() OVER(PARTITION BY key) from large-table-1 where key in ('k1', 'k2'); -- where, say, k1 has n and k2 has m exact dups -- Step 3: Remove duplicates from the temp table delete from test.large-table-1-dups where row_num > 1; select * from test.dim_line_items_dups; -- Sanity test. Should have 1 row each of k1 & k2 rows above -- Step 4: Delete all duplicates from main table... delete from large-table-1 where key in ('k1', 'k2'); -- Step 5: Insert data back into main table from temp dedupe data alter table test.large-table-1-dups drop column row_num; insert into large-table-1 select * from test.large-table-1-dups; 
+1
source

You should look at this answer from the PostgreSQL wiki , which also works for Vertica:

 DELETE FROM tablename WHERE id IN( SELECT id FROM ( SELECT id, ROW_NUMBER() OVER( partition BY column1, column2, column3 ORDER BY id ) AS rnum FROM tablename ) t WHERE t.rnum > 1 ); 

Deletes all duplicate entries, but with the smallest id.

-2
source

Source: https://habr.com/ru/post/1487129/


All Articles