Find SQL Server 2008 binary duplicate records (data type image)

I inherited a database with a table (300 gigabytes) full of SQL Image data type. I understand that this type of data is depreciating.

As a regular cleanup, I want to remove all duplicate Image from the table where certain conditions are met.

How to efficiently compare binary data using SQL? Is the equality operator = sufficient

Here is the scenario:

 Table 'Paperwork' int ID int EmployeeID int AnotherID int AnotherFKID image Attachment 

I want to find all rows where the values โ€‹โ€‹of Attachment , EmployeeID , AnotherID and AnotherFKID match. This must be done with minimal impact on the database, as there are more than 1,116,313 rows.

Edit

The SQL Server Image data type does not support LIKE or the usual comparison operators.

Edit

Thanks to @Martin who suggested that Image be added to varbinary. I added to this to get the MD5 checksum using Hashbytes

HASHBYTES('MD5',CAST(cast([Attachment] as varbinary(max))as varbinary)) AS AttachmentMD5

+6
source share
1 answer

Jeremiah,

Any all in one script will kill the buffer cache when reading at 300g. Divide the task into several tasks.

Task 1

  • create a table with ID and group to display duplicates of three int columns

Table example

  TableID PaperWorkID GroupID 1 14 1 2 15 1 3 21 2 4 55 2 

Now we know that PaperWorkID 14 and 15 use the same three int columns because they are in the same group.

Task 2

  • add the ( bigint ) column to the table and fill the column with the DATALENGTH column of the Image column in the Paperwork table based on PaperWorkID in the table
  • remove all non-duplicates based on datalength and groupid length

Task 3

  • Add the varbinary(max) column to the table.
  • fill a column with an MD5 hash of a PaperWorkID based image column in a table
  • Remove all non-duplicates from the table based on the MD5 hash and GroupID

Task 4

  • make 2 backup copies of the Paperwork table
  • Remove duplicate entries in Paperwork based on items remaining in the table.

If the data for an image column was scanned on paper, there is very little chance that two scans will produce the same image. If the data has been downloaded twice, you're in luck.

+3
source

Source: https://habr.com/ru/post/888480/


All Articles