How to use MD5 hash (or other binary data) as a key name?

Question

How to use MD5 hash (or other binary data) as a key name?

I am trying to use the MD5 hash as the key name in AppEngine, but the code I wrote causes a UnicodeDecodeError

from google.appengine.ext import db import hashlib key = db.Key.from_path('Post', hashlib.md5('thecakeisalie').digest())

I do not want to use hexdigest() , since it is not only kludge, but also worse (base64 will work better).

+4

python google-app-engine google-cloud-datastore

Noah McIlraith Dec 22 '10 at 10:20

source share

5 answers

App Engine Docs Python says:

The key name is saved as a Unicode string (with str values converted to ASCII).

The key must be unicode-encodeable. You need to change the call to digest () to hexdigest (), i.e.:

 k = hashlib.md5('thecakeisalie').hexdigest()

+12

vz0 Dec 22 '10 at 13:21

source share

Think about data sizes. The optimal solution here is 16 bytes:

 >>> hashlib.md5('thecakeisalie').digest() "'\xfc\xce\x84h\xa9\x1e\x8a\x12;\xa5\xb1K\xea\xef\xd6" >>> len(hashlib.md5('thecakeisalie').digest()) 16

The first thing you thought about was hexdigest, but it is not very close to 16 bytes:

 >>> hashlib.md5('thecakeisalie').hexdigest() '27fcce8468a91e8a123ba5b14beaefd6' >>> len(hashlib.md5('thecakeisalie').hexdigest()) 32

But this will not give you ascii-encodable bytes, so we need to do something else. A simple task is to use a python view:

 >>> repr(hashlib.md5('thecakeisalie').digest()) '"\'\\xfc\\xce\\x84h\\xa9\\x1e\\x8a\\x12;\\xa5\\xb1K\\xea\\xef\\xd6"' >>> len(repr(hashlib.md5('thecakeisalie').digest())) 54

We can get rid of a bunch of this by removing the "\ x" escape files and surrounding quotes:

 >>> repr(hashlib.md5('thecakeisalie').digest())[1:-1].replace('\\x','') "'fcce84ha91e8a12;a5b1Keaefd6" >>> len(repr(hashlib.md5('thecakeisalie').digest())[1:-1].replace('\\x','')) 28

This is very good, but base64 does a little better:

 >>> base64.b64encode(hashlib.md5('thecakeisalie').digest()) J/zOhGipHooSO6WxS+rv1g== >>> len(base64.b64encode(hashlib.md5('thecakeisalie').digest())) 24

In general, base64 is most efficient in terms of space, but I would just go with hexdigest, since it will probably be the most optimized (in terms of time).

Gnibbler's answer gives a length of 16!

 >>> hashlib.md5('thecakeisalie').digest().decode("iso-8859-1") u"'\xfc\xce\x84h\xa9\x1e\x8a\x12;\xa5\xb1K\xea\xef\xd6" >>> len(hashlib.md5('thecakeisalie').digest().decode("iso-8859-1")) 16

+4

bukzor Jan 7 '11 at 3:43

source share

I find using base64 encoding binary data a reasonable solution. Based on your code, you can do something like:

 import hashlib import base64 print base64.b64encode(hashlib.md5('thecakeisalie').digest())

+1

fsaint Dec 22 '10 at 10:25

source share

An Entity Key in App Engine can have either an ID (an integer of 4 bytes) or a name (a 500-byte UTF-8 encoded string).

The MD5 compilation is 16 bytes of binary data: invalid UTF-8 too large for an integer (likely to be). Some form of coding should be used.

If hexdigest () is too verbose at 32 bytes, try base64 with 24 bytes.

No matter what encoding scheme you use, it will eventually be converted to UTF-8 in a data warehouse, so the next one, which first looks like optimal encoding ...

 >>> u = hashlib.md5('thecakeisalie').digest().decode("iso-8859-1") >>> len(u) 16

... when encoding into it, the final representation is two bytes longer than base64 encoding:

 >>> s = u.encode('utf-8') >>> len(s) 26

+1

user103576 Jan 18 '11 at 21:57

source share

John la rooy · Accepted Answer · 2011-01-07T03:07:37+0000

decode bytes using iso-8859-1

 >>> hashlib.md5('thecakeisalie').digest().decode("iso-8859-1") u"'\xfc\xce\x84h\xa9\x1e\x8a\x12;\xa5\xb1K\xea\xef\xd6"

This is basically a "NOP" conversion. It creates a unicode object with the same length as the original string, and can be converted back to only .encode("iso-8859-1") if you want

How to use MD5 hash (or other binary data) as a key name?

More articles: