Why does Python compute "hashlib.sha1" different from the "git hash-object" for the file?

I am trying to calculate the SHA-1 value of a file.

I created this script:

def hashfile(filepath): sha1 = hashlib.sha1() f = open(filepath, 'rb') try: sha1.update(f.read()) finally: f.close() return sha1.hexdigest() 

For a specific file, I get this hash value:
8c3e109ff260f7b11087974ef7bcdbdc69a0a3b9
But when I calculate the value using git hash_object, then I get this value: d339346ca154f6ed9e92205c3c5c38112e761eb7

How do they differ? Am I doing something wrong, or can I just ignore the difference?

+45
git python hash
Dec 08 '09 at 21:13
source share
2 answers

git computes the hashes as follows:

 sha1("blob " + filesize + "\0" + data) 

Link

+51
Dec 08 '09 at 21:17
source share

For reference, here is a shorter version:

 def sha1OfFile(filepath): import hashlib with open(filepath, 'rb') as f: return hashlib.sha1(f.read()).hexdigest() 

The second time: although I have never seen this, I think that the potential of f.read() to return is less than the full file, or for a file with a large gigabyte, for f.read () to run out of memory. For each instruction, consider how to fix it: first fix:

 def sha1OfFile(filepath): import hashlib sha = hashlib.sha1() with open(filepath, 'rb') as f: for line in f: sha.update(line) return sha.hexdigest() 

However, there is no guarantee that '\n' appears in the file at all, so the fact that the for loop will give us file blocks ending in '\n' can give us the same problem that we originally had, Unfortunately, I I don’t see a similar Pythonic method for iterating over file blocks as much as possible, which, I think, means that we are stuck in a while True: ... break and with a magic number for the block size:

 def sha1OfFile(filepath): import hashlib sha = hashlib.sha1() with open(filepath, 'rb') as f: while True: block = f.read(2**10) # Magic number: one-megabyte blocks. if not block: break sha.update(block) return sha.hexdigest() 

Of course, who said that we can store one megabyte string. Maybe we can, but what if we are on a tiny embedded computer?

I'm sorry that I could not think of a cleaner method, which, as guaranteed, does not exhaust memory with huge files and has no magic numbers, and also fulfills Pythonic's initial simple solution.

+31
Oct 31 '13 at 16:14
source share



All Articles