How to uniquely identify rows in a table without a primary key

I am importing more than 600,000,000 rows from an old database / table that does not have a primary key set; this table is in the SQL Server 2005 database. I created a tool to import this data into a new database with a completely different structure. The problem is that I want to resume the process from which it stopped for any reason, such as an error or a network error. Since there is no primary key in this table, I cannot check whether the row has already been imported or not. Does anyone know how to define each row so that I can check if it has already been imported or not? This table has a duplicate row, I already tried to calculate the hash of all columns, but it does not work due to duplicate rows ...

thanks!

+4
source share
4 answers

I would cite the rows in the staging table if this comes from another database - the one on which the identifier is set. You can then identify the rows where all other data is the same except for the identifier, and delete the duplicates before trying to put them in your production table.

+4
source

So: you load a lot of data-based rows, the rows cannot be uniquely identified, the download can (and apparently will) be interrupted at any time at any time, and you want to be able to resume such an interrupted download from where you stopped, although for all practical purposes you cannot determine where you left off. Good.

Downloading to a table containing an additional identifier column will work, provided that , and whenever data loading starts, it always starts from the same element and loads the elements in the same order. Wildly inefficient, as you have to read every time you start it.

Another awkward option would be to first split the data you load into chunks of a manageable size (maybe 10,000,000 rows). Load them with a piece, keeping track of which piece you downloaded. Use the table of intermediate elements so that you know and can control when the piece is โ€œfully processedโ€. If / when interrupted, you only drop the piece you are working on when interrupted and resume working with this piece.

+1
source

With repeated rows, even row_number() will not go anywhere, as this can change between queries (due to the fact that MSSQL stores data). You need to either bring it to the landing table with an identification column, or add a new column with an identifier to an existing table ( alter table oldTbl add column NewId int identity(1,1) ).

You can use row_number() and then go back to the last rows of n if there is more to them than the score in the new database, but it would be more straight forward to use the landing table.

0
source

Option 1: Duplicates Can Be Discarded

Try to find some unique combination of fields. (duplicates are allowed) and are attached to the hash of the remaining fields that you save in the destination table.

Suppose the table:

 create table t_x(id int, name varchar(50), description varchar(100)) create table t_y(id int, name varchar(50), description varchar(100), hash varbinary(8000)) select * from t_x x where not exists(select * from t_y y where x.id = y.id and hashbytes('sha1', x.name + '~' + x.description) = y.hash) 

The reason to try to combine as many fields as possible is to reduce the likelihood of hash collisions that are real in the data set with 600,000,000 entries.

Option 2: duplicates matter

If you really need duplicate rows, you should add a unique id column to your large table. To achieve this, you must follow these steps:

  • Modify the table and add a unique identifier or int field
  • refresh table using newsequentialid () or row_number () function
  • create index in this field
  • add the id field to the destination table.
  • After moving all the data, the field can be discarded.
0
source

Source: https://habr.com/ru/post/1386580/


All Articles