This will depend on the files you are comparing.
A) In the worst case:
- You have many files with the same size.
- Files are very large
- The files are very similar with the differences found in a narrow random folder in the file
For example, if you have:
- 100x 2MB files of the same size,
- comparison with each other,
- using binary comparison with
- 50% of the read files (the probability of finding an unequal byte in the first half of the file)
Then you will have:
- 10,000 comparisons
- 1MB which is equal
- only 10 GB of reading.
However, if you had the same script but received the file hashes first , you would:
- read 200 MB of data from disk (usually the slowest component on a computer)
- 1.6K in memory (using MD5-hasing - 16 bytes - security is not important)
- and will read 2N * 2MB for the final direct binary comparison, where N is the number of duplicates found.
I think this worst case scenario is not typical .
B) Typical scenario:
- Files usually vary in size.
- Files are likely to differ at the beginning of the file . This means that direct binary comparisons usually do not include reading the entire file in most different files of the same size. li>
For example, if you have:
- A folder of MP3 files (they don't get too big - maybe no more than 5 MB).
- 100 files
- size check first
- no more than 3 files of the same size (duplicate or not)
- using binary comparison for files with the same size
- Most likely 99% after 1KBytes
Then you will have:
- No more than 33 cases when the length is the same in 3 sets of files
- Parallel binary reading of three files (or more is possible) simultaneously in 4K blocks
- Found 0% duplicates - 33 * 3 * 4K files to read = 396 KB of disk read
- With 100% multiplication, found = 33 * 3 * N, where N is the file size (~ 5 MB) = ~ 495 MB
If you expect 100% multiples, hashing will not be more efficient than direct binary comparison. Given that you should expect <100% multiple values, hashing will be less efficient than direct binary comparison.
C) Re-comparison
This is an exception. Building a hash database + length + paths for all files will speed up repeated comparisons. But the benefits will be negligible. This will require 100% reading of the files initially and storage of the hash database. The new file must be read 100% and then added to the database, and if it is consistent, direct binary comparison will still be required as the last stage of the comparison (to avoid a hash collision). Even if most files are of different sizes, when a new file is created in the target folder, it can match the existing file size and can be quickly excluded from direct comparison.
Finally:
- No additional hashes should be used (the final test - binary comparison - should always be final)
- Binary comparisons are often more efficient on first launch, when there are many files of different sizes.
- MP3 comparison works well with length than binary comparison.
source share