How to find identical files without comparing them with eachother?

Question

How to find identical files without comparing them with eachother?

I am creating a website where users can upload content. As always, I strive for world domination, so I would like not to store the same file twice. For example, if a user tries to download the same file twice (by renaming or just forgetting what she did in the past).

My current approach is for the database to track every downloaded file, storing the following information about each file:

file size in bytes
MD5 file contents sum
SHA1 file content amount

And then a unique index on these three columns. Using two hashes to minimize the risk of false positives.

So my question really is: what is the likelihood of two different ("real world") files of the same size having the same MD5 and SHA1 hashes

Or: Is there a smarter method of similar (un) complexity?

(I understand that the probability may depend on the size of the file).

Thank!

+3

comparison file statistics unique hash-collision

MattBianco Feb 16 '11 at 13:22

source share

3 answers

, MD5 SHA1 . , (SHA1, ) . , , - , . , - , , , "".

edit: MD5 + SHA1. , . , (SHA1, MD5) , 2 ^ -288, , . .

+2

yan 16 . '11 13:30

Brogers Rabin. , sha1 md5, , . , , , - , , . , .

#, :

http://www.developpez.net/forums/d863959/dotnet/general-dotnet/contribuez/algorithm-rabin-fingerprint/

0

George 13 . '14 10:57

Fred Foo · Accepted Answer · 2011-02-16T13:38:29+0000

The probability of two real-world files of the same size having the same SHA1 hash is zero for all practical purposes. Some flaws in SHA1 were discovered, but creating a file from the SHA1 hash and size (1) is incredibly expensive in terms of processing power and (2) creates either garbage or the source file.

MD5 - . SHA-1, SHA-2.

, (, SHA1) . , , , .

How to find identical files without comparing them with eachother?

More articles: