C #, the fastest (best?) Method for identifying duplicate files in an array of directories

I want to overwrite multiple directories and find duplicate files between n number of directories.

The idea of โ€‹โ€‹my knee-jerk reflex is to have a global hash table or some other data structure to store each file found; then check each subsequent file to determine if it is in the โ€œmainโ€ file list. Obviously, I donโ€™t think it would be very effective, and "there must be a better way!" keeps ringing in my brain.

Any advice on the best way to handle this situation would be appreciated.

+4
source share
5 answers

You can avoid hashing by first comparing file sizes. If you never find files with the same size, you do not need to hash them. You only haveh the file, as soon as you find another file with the same size, then you haveh both of them.

This should be significantly faster than blind hashing each file, although it would be more difficult to implement this two-level check.

+15
source

I would suggest storing multiple indexes in memory.

Create a file that indexes all files by file length:

Dictionary<int, List<FileInfo>> IndexBySize; 

When you process a new Fu file, it quickly searches for all other files with the same size.

Create another one that indexes all files by the modification timestamp:

 Dictionary<DateTime, List<FileInfo>> IndexByModification; 

Given Fu file, you can find all files modified at the same time.

Repeat for each characteristic signficiant. You can then use the Intersect() extension method to efficiently compare multiple criteria.

For instance:

 var matchingFiles = IndexBySize[fu.Size].Intersect(IndexByModification[fu.Modified]); 

This will allow you to avoid phased scanning until you need it. Then, for the files that hashed, create another index:

 Dictionary<MD5Hash, List<FileInfo>> IndexByHash; 

You might want to compute multiple hashes at the same time to reduce the number of conflicts.

+3
source

Your approach sounds reasonable to me. If you have no good reason to assume that you do not have enough of your performance requirements, I would simply implement it in this way and, if necessary, optimize it later. Remember that "premature optimization is the root of evil."

+2
source

The best practice, as John Kugelman said, is to first compare two files with the same size, if they have different sizes, it is obvious that they are not duplicated.

if you find two files of the same size, for best performance you can compare the first 500 KB of two files, if the first 500 KB are the same, you can compare the remaining bytes. this way you donโ€™t need to read all the bytes (for example) of a 500 MB file to get a hash, so you save time and improve performance

+1
source

For comparison, in bytes, where you expect a lot of duplicates, you are most likely to use the method that you are already viewing.

If you are really concerned about performance and know that duplicates will always have the same file name, you can start by comparing the file names with only the hash bytes when you find the duplicate name. This way you save hash time for files that do not have duplicates in the tree.

0
source

Source: https://habr.com/ru/post/1309488/


All Articles