How to find identical files without comparing them with eachother?

I am creating a website where users can upload content. As always, I strive for world domination, so I would like not to store the same file twice. For example, if a user tries to download the same file twice (by renaming or just forgetting what she did in the past).

My current approach is for the database to track every downloaded file, storing the following information about each file:

  • file size in bytes
  • MD5 file contents sum
  • SHA1 file content amount

And then a unique index on these three columns. Using two hashes to minimize the risk of false positives.

So my question really is: what is the likelihood of two different ("real world") files of the same size having the same MD5 and SHA1 hashes

Or: Is there a smarter method of similar (un) complexity?

(I understand that the probability may depend on the size of the file).

Thank!

+3
source share
3 answers

The probability of two real-world files of the same size having the same SHA1 hash is zero for all practical purposes. Some flaws in SHA1 were discovered, but creating a file from the SHA1 hash and size (1) is incredibly expensive in terms of processing power and (2) creates either garbage or the source file.

MD5 - . SHA-1, SHA-2.

, (, SHA1) . , , , .

+6

, MD5 SHA1 . , (SHA1, ) . , , - , . , - , , , "".

edit: MD5 + SHA1. , . , (SHA1, MD5) , 2 ^ -288, , . .

+2

Source: https://habr.com/ru/post/1792525/


All Articles