Creating a unique key based on file contents in python

Question

Creating a unique key based on file contents in python

I got a lot of files that need to be uploaded to the server, and I just want to avoid duplicates.

Thus, generating a unique and small key value from a large string seemed like something a checksum should have had, and hashing seemed like an evolution of that .

So, I was going to use hash md5 for this. But then I read somewhere that "MD5 is not intended for unique keys," and I thought it was really strange.

What is the right way to do this?

edit: by the way, I took two sources to get to the next thing I'm doing now, and it works fine, with Python 2.5:

import hashlib def md5_from_file (fileName, block_size=2**14): md5 = hashlib.md5() f = open(fileName) while True: data = f.read(block_size) if not data: break md5.update(data) f.close() return md5.hexdigest()

+4

python cryptography hash checksum unique-key

cregox May 04 '10 at 10:47 p.m.

source share

4 answers

The problem with hashing is that it generates a "small" identifier from a "large" dataset. This is like loss of compression. Although you cannot guarantee uniqueness, you can use it to significantly limit the number of other elements that you need to compare.

Note that MD5 gives a 128-bit value (I think it is, although the exact number of bits does not matter). If your input data set has 129 bits, and you actually use them all, each MD5 value will be displayed on average twice. For longer data sets (for example, “all text files are exactly 1024 printable characters”), you will still encounter conflicts when you get enough entries. Contrary to what another answer said, the mathematical certainty is that you will run into conflicts.

See http://en.wikipedia.org/wiki/Birthday_Paradox

Of course, you have a 1% chance of a collision with a 128-bit hash of 2.6 * 10 ^ 18 entries, but it's better to deal with the case when you encounter conflicts than to hope that you will never.

+3

dash-tom-bang May 04 '10 at 11:13

source share

The problem with MD5 is that it is broken. There are few problems for most common applications, and people still use both MD5 and SHA1, but I think that if you need a hash function, you need a strong hash function. As far as I know, there is still no standard replacement for any of them. There are a number of algorithms that are "assumed" to be strong, but we have the most experience with SHA1 and MD5. That is, we (think) know when these two break, while we do not know so much when new algorithms break.

Bottom line: think about the risks. If you want to go the extra mile, you can add additional checks if you find a hash copy for the price of a performance penalty.

+2

wilhelmtell May 04 '10 at 10:57

source share

Tip. Think about how a hash table works.

0

kwatford May 04, '10 at 22:54

source share

Jj. · Accepted Answer · 2010-05-04T23:37:11+0000

Sticking to MD5 is a good idea. Just to make sure that I add the file length or the number of pieces to your file hash table.

Yes, it is likely that you will encounter two files that have the same MD5 hash, but this is unlikely (if your files are of a decent size). Thus, adding the number of fragments to the hash can help you reduce this, since now you need to find two files of the same size with the same MD5.

 # This is the algorithm you described, but also returns the number of chunks. new_file_hash, nchunks = hash_for_tile(new_file) store_file(new_file, nchunks, hash) def store_file(file, nchunks, hash): "" Tells you whether there is another file with the same contents already, by making a table lookup "" # This can be a DB lookup or some way to obtain your hash map big_table = ObtainTable() # Two level lookup table might help performance # Will vary on the number of entries and nature of big_table if nchunks in big_table: if hash in big_table[hash]: raise DuplicateFileException,\ 'File is dup with %s' big_table[nchunks][lookup_hash] else: big_table[nchunks] = {} big_table[nchunks].update({ hash: file.filename }) file.save() # or something

To reduce this possibility, switch to SHA1 and use the same method. Or even use both (concatenation) if performance is not a problem.

Of course, keep in mind that this will only work with duplicate files at the binary level, and not with images, sounds, videos that are "the same" but have different signatures.

Creating a unique key based on file contents in python

More articles: