Duplicate File Detection

I want to detect duplicate files in a directory tree. When two identical files are found, only one of the duplicates will be saved, and the remaining duplicates will be deleted to save disk space.

Duplicate means that the files have the same content, which may differ in file names and paths.

I thought about using hash algorithms for this purpose, but it is likely that different files have the same hashes, so I need another mechanism to tell me that the files do not match, even if the hashes are the same because I do not want to delete two different file.

What additional fast and reliable mechanism would you use?

+4
source share
7 answers

Hash calculation will make your program slow. It is also better to check the file size . All duplicate files must have the same file size. If they use the same file size, use a hash check. This will make your program fast.

There may be more steps.

  • Check if file size is equal
  • If step 1 passes, check to see if the first and last byte range are equal (say 100 bytes)
  • If step 2 passes, check the file type ,
  • If step 3 passes, check the hash

The more criteria you add, the faster it will work, and you can avoid the last ( hash ) this way.

+12
source

This will depend on the files you are comparing.

A) In the worst case:

  • You have many files with the same size.
  • Files are very large
  • The files are very similar with the differences found in a narrow random folder in the file

For example, if you have:

  • 100x 2MB files of the same size,
  • comparison with each other,
  • using binary comparison with
  • 50% of the read files (the probability of finding an unequal byte in the first half of the file)

Then you will have:

  • 10,000 comparisons
  • 1MB which is equal
  • only 10 GB of reading.

However, if you had the same script but received the file hashes first , you would:

  • read 200 MB of data from disk (usually the slowest component on a computer)
  • 1.6K in memory (using MD5-hasing - 16 bytes - security is not important)
  • and will read 2N * 2MB for the final direct binary comparison, where N is the number of duplicates found.

I think this worst case scenario is not typical .

B) Typical scenario:

  • Files usually vary in size.
  • Files are likely to differ at the beginning of the file . This means that direct binary comparisons usually do not include reading the entire file in most different files of the same size. li>

For example, if you have:

  • A folder of MP3 files (they don't get too big - maybe no more than 5 MB).
  • 100 files
  • size check first
  • no more than 3 files of the same size (duplicate or not)
  • using binary comparison for files with the same size
  • Most likely 99% after 1KBytes

Then you will have:

  • No more than 33 cases when the length is the same in 3 sets of files
  • Parallel binary reading of three files (or more is possible) simultaneously in 4K blocks
  • Found 0% duplicates - 33 * 3 * 4K files to read = 396 KB of disk read
  • With 100% multiplication, found = 33 * 3 * N, where N is the file size (~ 5 MB) = ~ 495 MB

If you expect 100% multiples, hashing will not be more efficient than direct binary comparison. Given that you should expect <100% multiple values, hashing will be less efficient than direct binary comparison.

C) Re-comparison

This is an exception. Building a hash database + length + paths for all files will speed up repeated comparisons. But the benefits will be negligible. This will require 100% reading of the files initially and storage of the hash database. The new file must be read 100% and then added to the database, and if it is consistent, direct binary comparison will still be required as the last stage of the comparison (to avoid a hash collision). Even if most files are of different sizes, when a new file is created in the target folder, it can match the existing file size and can be quickly excluded from direct comparison.

Finally:

  • No additional hashes should be used (the final test - binary comparison - should always be final)
  • Binary comparisons are often more efficient on first launch, when there are many files of different sizes.
  • MP3 comparison works well with length than binary comparison.
+2
source

The hash solution is great - you just need to make one of the collision mechanisms to work with two elements that are hashed to the same value. [ chaining or open addressing .

Just iteratively add the elements - if your implementation detects that there is a hoax, it will not add it to the hash set. You will find out that the element is a hoax if the set size has not been resized after trying to add the element.

Most likely, your language already has implementations of this type of data structure, for example, HashSet in java and unordered_set in C ++.

+1
source

If you use the SHA-1 hash algorithm or better than SHA-256 or higher, I really doubt that you will get the same hash value for two different files. SHA is a cryptographic hash function and is used in version control systems such as Git. Therefore, you can be sure that he will do the work for you.

But if you still need additional checks, you can follow these two steps.
1) Header parsing is a really tough cookie to crack as different formats may have different header lengths
2) Have some sanity checks - file size, read random file positions and try to check if they are the same

+1
source

This is typical md5sum output:

0c9990e3d02f33d1ea2d63afb3f17c71 

If you don’t need to fear intentionally forged files, the likelihood that a second, random file will match,

 1/(decimal(0xfffffffffffffffffffffffffffffff)+1) 

If you consider file size as an additional test, your confidence increases that both files are suitable. You can add more and more dimensions, but bitwise comparisons will be the last word in such discussions. For practical use, md5sum should be enough.

+1
source
 /// ------------------------------------------------------------------------------------------------------------------------ /// <summary> /// Writes duplicate files to a List<String> /// </summary> private void CompareDirectory(string[] files) { for (int i = 0; i < files.Length; i++) { FileInfo one = new FileInfo(files[i]); // Here a spot for a progressbar or something for (int i2 = 0; i2 < files.Length; i2++) { if (i != i2 && !duplicatePathsOne.Contains(files[i2])) // In order to prevent duplicate entries { FileInfo two = new FileInfo(files[i2]); if (FilesAreEqual_OneByte(one, two)) { duplicatePathsOne.Add(files[i]); duplicateNamesOne.Add(Path.GetFileName(files[i])); duplicatePathsTwo.Add(files[i2]); duplicateNamesTwo.Add(Path.GetFileName(files[i2])); } } } } } /// ------------------------------------------------------------------------------------------------------------------------ /// <summary> /// Compares files by binary /// </summary> private static bool FilesAreEqual_OneByte(FileInfo first, FileInfo second) { if (first.Length != second.Length) return false; using (FileStream fs1 = first.OpenRead()) using (FileStream fs2 = second.OpenRead()) { for (int i = 0; i < first.Length; i++) { if (fs1.ReadByte() != fs2.ReadByte()) return false; } } return true; } 
0
source

Here is a VBS script that will generate a CSV file to display duplicate files in a folder based on the MD5 file checksum and file size.

 Set fso = CreateObject("Scripting.FileSystemObject") Dim dic: Set dic = CreateObject("Scripting.Dictionary") Dim oMD5: Set oMD5 = CreateObject("System.Security.Cryptography.MD5CryptoServiceProvider") Dim oLog 'As Scripting.TextStream Set oArgs = WScript.Arguments If oArgs.Count = 1 Then sFolderPath = GetFolderPath() Set oLog = fso.CreateTextFile(sFolderPath & "\DublicateFiles.csv", True) oLog.Write "sep=" & vbTab & vbCrLf CheckFolder oArgs(I) oLog.Close Msgbox "Done!" Else Msgbox "Drop Folder" End If Sub CheckFolder(sFolderPath) Dim sKey Dim oFolder 'As Scripting.Folder Set oFolder = fso.GetFolder(sFolderPath) For Each oFile In oFolder.Files 'sKey = oFile.Name & " - " & oFile.Size sKey = GetMd5(oFile.Path) & " - " & oFile.Size If dic.Exists(sKey) = False Then dic.Add sKey, oFile.Path Else oLog.Write oFile.Path & vbTab & dic(sKey) & vbCrLf End If Next For Each oChildFolder In oFolder.SubFolders CheckFolder oChildFolder.Path Next End Sub Function GetFolderPath() Dim oFile 'As Scripting.File Set oFile = fso.GetFile(WScript.ScriptFullName) GetFolderPath = oFile.ParentFolder End Function Function GetMd5(filename) Dim oXml, oElement oMD5.ComputeHash_2(GetBinaryFile(filename)) Set oXml = CreateObject("MSXML2.DOMDocument") Set oElement = oXml.CreateElement("tmp") oElement.DataType = "bin.hex" oElement.NodeTypedValue = oMD5.Hash GetMd5 = oElement.Text End Function Function GetBinaryFile(filename) Dim oStream: Set oStream = CreateObject("ADODB.Stream") oStream.Type = 1 'adTypeBinary oStream.Open oStream.LoadFromFile filename GetBinaryFile= oStream.Read oStream.Close Set oStream = Nothing End Function 

Here is a VBS script that will generate a CSV file to display duplicate files in a folder based on the file name and size.

 Set fso = CreateObject("Scripting.FileSystemObject") Dim dic: Set dic = CreateObject("Scripting.Dictionary") Dim oLog 'As Scripting.TextStream Set oArgs = WScript.Arguments If oArgs.Count = 1 Then sFolderPath = GetFolderPath() Set oLog = fso.CreateTextFile(sFolderPath & "\DublicateFiles.csv", True) oLog.Write "sep=" & vbTab & vbCrLf CheckFolder oArgs(I) oLog.Close Msgbox "Done!" Else Msgbox "Drop Folder" End If Sub CheckFolder(sFolderPath) Dim sKey Dim oFolder 'As Scripting.Folder Set oFolder = fso.GetFolder(sFolderPath) For Each oFile In oFolder.Files sKey = oFile.Name & " - " & oFile.Size If dic.Exists(sKey) = False Then dic.Add sKey, oFile.Path Else oLog.Write oFile.Path & vbTab & dic(sKey) & vbCrLf End If Next For Each oChildFolder In oFolder.SubFolders CheckFolder oChildFolder.Path Next End Sub Function GetFolderPath() Dim oFile 'As Scripting.File Set oFile = fso.GetFile(WScript.ScriptFullName) GetFolderPath = oFile.ParentFolder End Function 
0
source

Source: https://habr.com/ru/post/1402690/


All Articles