Remove duplicates from a large dataset (> 100Mio rows)

I know that this topic has arisen many times before, but none of the proposed solutions worked for my data set, because my laptop stopped calculating due to memory problems or full storage.

My table looks like this and has 108 rows of Mio:

 Col1 |Col2 | Col3 |Col4 |SICComb | NameComb Case New |3523 | Alexander |6799 |67993523| AlexanderCase New Case New |3523 | Undisclosed |6799 |67993523| Case NewUndisclosed Undisclosed|6799 | Case New |3523 |67993523| Case NewUndisclosed Case New |3523 | Undisclosed |6799 |67993523| Case NewUndisclosed SmartCard |3674 | NEC |7373 |73733674| NECSmartCard SmartCard |3674 | Virtual NetComm|7373 |73733674| SmartCardVirtual NetComm SmartCard |3674 | NEC |7373 |73733674| NECSmartCard 

Unique columns: SICComb and NameComb . I tried adding a primary key with:

 ALTER TABLE dbo.test ADD ID INT IDENTITY(1,1) 

but integers fill up more than 30 GB of my storage in just a few minutes.

What will be the fastest and most effective method of removing duplicates from a table?

+4
source share
2 answers

In general, the fastest way to remove duplicates from a table is to insert records - without duplicates - into a temporary table, trim the original table and paste it back.

Here is an idea using SQL Server syntax:

 select distinct t.* into #temptable from t; truncate table t; insert into t select tt.* from #temptable; 

Of course, this largely depends on how fast the first step is. And you need to have a place to store two copies of the same table.

Note that the syntax for creating a temporary table is different from the database. Some use the create table as syntax rather than select into .

EDIT:

The error in entering the identification information is difficult. I think you need to remove the id from the list of columns for the individual. Or do:

 select min(<identity col>), <all other columns> from t group by <all other columns> 

If you have an identifier column, then there are no duplicates (by definition).

In the end, you will need to decide which identifier you want for the strings. If you can create a new identifier for the rows, just leave the identifier column from the list of columns to insert:

 insert into t(<all other columns>) select <all other columns>; 

If you need the old identifier value (and the minimum value will be done), disable insertion and do the following:

 insert into t(<all columns including identity>) select <all columns including identity>; 
+2
source

If you are using SQL Server, you can use delete from a regular table expression:

 with cte as ( select row_number() over(partition by SICComb, NameComb order by Col1) as row_num from Table1 ) delete from cte where row_num > 1 

Here, all lines will be numbered, you will get your own sequence for each unique combination of SICComb + NameComb . You can choose which rows you want to delete by selecting order by inside the over clause.

+6
source

Source: https://habr.com/ru/post/1498719/


All Articles