What is the correct way to compress and decompress UTF-8 data using zlib?

Question

What is the correct way to compress and decompress UTF-8 data using zlib?

I have a very long JSON message containing characters that go beyond the limits of an ASCII table. I convert it to a string as follows:

messStr = json.dumps(message,encoding='utf-8', ensure_ascii=False, sort_keys=True)

I need to save this string using a service that limits its size to X bytes. I want to break a JSON string into pieces of length X and store them separately. I ran into some problems doing this (described here ), so I want to compress line cuts to get around these problems. I tried to do this:

 ss = mStr[start:fin] # get piece of length X ssc = zlib.compress(ss) # compress it

When I do this, I get the following error from zlib.compress :

 UnicodeEncodeError: 'ascii' codec can't encode character u'\xf1' in position 225: ordinal not in range(128)

What is the correct way to compress a UTF-8 string and how to decompress it correctly?

+6

json python utf-8 compression

IZ Aug 26 '13 at 17:32

source share

2 answers

A small addition to Martijn's answer. I read on the Enthoughth blog a great lineup report that saves you the trouble of importing zlib into your own code.

Safe line compression (including your json dump) will look like this:

 ssc = ss.encode('utf-8').encode('zlib_codec')

Decompression back to utf-8 will be:

 ss = ssc.decode('zlib_codec').decode('utf-8')

Hope this helps.

+10

Lynx-lab Nov 22 '14 at 12:26

source share

Martijn pieters · Accepted Answer · 2013-08-26T17:36:17+0000

JSON data is not UTF-8 encoded. The encoding parameter of the json.dumps() function specifies how to interpret Python byte strings in message (for example, input), and not how to encode the result. It does not encode output at all, because you used ensure_ascii=False .

Encode data before compression:

 ssc = zlib.compress(ss.encode('utf8'))

When unpacking again, there is no need to decode from UTF-8; json.loads() accepts UTF-8 if the input is a byte setting.

What is the correct way to compress and decompress UTF-8 data using zlib?

More articles: