Urllib.urlencode doesn't like unicode values: how about this workaround?

If I have an object like:

d = {'a':1, 'en': 'hello'} 

... then I can pass it urllib.urlencode , no problem:

 percent_escaped = urlencode(d) print percent_escaped 

But if I try to pass an object with a value of type unicode , the game is finished:

 d2 = {'a':1, 'en': 'hello', 'pt': u'olá'} percent_escaped = urlencode(d2) print percent_escaped # This fails with a UnicodeEncodingError 

So my question is about a reliable way to prepare an object for urlencode .

I came up with this function where I just iterate over an object and encode values ​​like string or unicode:

 def encode_object(object): for k,v in object.items(): if type(v) in (str, unicode): object[k] = v.encode('utf-8') return object 

It works:

 d2 = {'a':1, 'en': 'hello', 'pt': u'olá'} percent_escaped = urlencode(encode_object(d2)) print percent_escaped 

And this prints a=1&en=hello&pt=%C3%B3la , ready for transmission to a POST call or something else.

But my encode_object function looks very shaky for me. Firstly, it does not process nested objects.

On the other hand, I am nervous about this statement. Are there any other types that I should consider?

And compares type() something with a native object, how is this good practice?

 type(v) in (str, unicode) # not so sure about this... 

Thank!

+47
python unicode urlencode
Jun 25 2018-11-21T00:
source share
8 answers

You really have to be nervous. The whole idea that you might have a mixture of bytes and text in some data structure is horrific. This violates the fundamental principle of working with string data: decoding during input, working exclusively in Unicode, encoding at the output.

Update in response to comment:

You are about to output some kind of HTTP request. This needs to be prepared as a string of bytes. The fact that urllib.urlencode is not able to properly prepare this byte string if your dict has Unicode characters with serial number = 128 is really unsuccessful. If you have a mixture of byte strings and unicode strings in your dict, you should be careful. Consider only what urlencode () does:

 >>> import urllib >>> tests = ['\x80', '\xe2\x82\xac', 1, '1', u'1', u'\x80', u'\u20ac'] >>> for test in tests: ... print repr(test), repr(urllib.urlencode({'a':test})) ... '\x80' 'a=%80' '\xe2\x82\xac' 'a=%E2%82%AC' 1 'a=1' '1' 'a=1' u'1' 'a=1' u'\x80' Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\python27\lib\urllib.py", line 1282, in urlencode v = quote_plus(str(v)) UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128) 

The last two tests demonstrate a problem with urlencode (). Now let's look at the str tests.

If you insist on having a mixture, you should at least ensure that str objects are encoded in UTF-8.

'\ x80' is suspicious - it is not the result of any_valid_unicode_string.encode ('utf8').
'\ xe2 \ x82 \ xac' is OK; this is the result of u '\ u20ac'.encode (' utf8 ').
"1" in order - all ASCII characters in order, at the input to urlencode (), which will be, if necessary, percent encoding, such as "%".

The proposed converter function is offered here. It does not mutate the input signal and does not return it (as yours does); it returns a new dict. It throws an exception if the value is a str object but is not a valid UTF-8 string. By the way, your concern about this, without resorting to nested objects, is a little incorrectly indicated - your code only works with dicts, and the concept of nested dicts really does not fly.

 def encoded_dict(in_dict): out_dict = {} for k, v in in_dict.iteritems(): if isinstance(v, unicode): v = v.encode('utf8') elif isinstance(v, str): # Must be encoded in UTF-8 v.decode('utf8') out_dict[k] = v return out_dict 

and here's the conclusion, using the same tests in reverse order (because this time nasty at the front):

 >>> for test in tests[::-1]: ... print repr(test), repr(urllib.urlencode(encoded_dict({'a':test}))) ... u'\u20ac' 'a=%E2%82%AC' u'\x80' 'a=%C2%80' u'1' 'a=1' '1' 'a=1' 1 'a=1' '\xe2\x82\xac' 'a=%E2%82%AC' '\x80' Traceback (most recent call last): File "<stdin>", line 2, in <module> File "<stdin>", line 8, in encoded_dict File "C:\python27\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte >>> 

Does it help?

+63
Jun 25 '11 at 23:22
source share

This seems to be a broader topic than it seems, especially when you have to deal with more complex dictionary meanings. I found 3 ways to solve the problem:

  • Run urllib.py to enable the encoding option:

     def urlencode(query, doseq=0, encoding='ascii'): 

    and replace all str(v) conversions with something like v.encode(encoding)

    Obviously, this is not good, since it is unlikely to be distributed and even more difficult to maintain.

  • Change the default Python encoding as described here . The author of the blog quite clearly describes some of the problems with this solution, and who knows how more of them can be hidden in the shadows. So it doesn’t look good to me.

  • So, I personally fell into this abomination, which encodes all unicode lines into UTF-8 byte lines in any (reasonably) complex structure:

     def encode_obj(in_obj): def encode_list(in_list): out_list = [] for el in in_list: out_list.append(encode_obj(el)) return out_list def encode_dict(in_dict): out_dict = {} for k, v in in_dict.iteritems(): out_dict[k] = encode_obj(v) return out_dict if isinstance(in_obj, unicode): return in_obj.encode('utf-8') elif isinstance(in_obj, list): return encode_list(in_obj) elif isinstance(in_obj, tuple): return tuple(encode_list(in_obj)) elif isinstance(in_obj, dict): return encode_dict(in_obj) return in_obj 

    You can use it as follows: urllib.urlencode(encode_obj(complex_dictionary))

    To also encode the keys, out_dict[k] can be replaced with out_dict[k.encode('utf-8')] , but that was too much for me.

+7
Oct 26 '14 at 0:21
source share

I had the same problem with the German Umlaut. The solution is pretty simple:

In Python 3+, urlencode allows you to specify an encoding:

 from urllib import urlencode args = {} args = {'a':1, 'en': 'hello', 'pt': u'olá'} urlencode(args, 'utf-8') >>> 'a=1&en=hello&pt=ol%3F' 
+6
Apr 21 '16 at 9:16
source share

It seems that you cannot pass a Unicode object to urlencode, so before calling it, you must encode every parameter of the unicode object. How you do it right, it seems to me that I am very context sensitive, but in your code you should always know when to use the unicode python object (unicode representation) and when to use the encoded object (bytestring).

Also, encoding str values ​​is "superfluous": What is the difference between encoding / decoding?

+5
Jun 25 2018-11-22T00:
source share

Nothing new to add but to point out that the urlencode algorithm is nothing more complicated. Instead of processing your data once and then calling urlencode on it, it would be nice to do something like:

 from urllib import quote_plus def urlencode_utf8(params): if hasattr(params, 'items'): params = params.items() return '&'.join( (quote_plus(k.encode('utf8'), safe='/') + '=' + quote_plus(v.encode('utf8'), safe='/') for k, v in params)) 

Looking at the source code of the urllib module (Python 2.6), their implementation does not do more. There is an additional function in which values ​​in parameters that are 2-tuples turn into separate key-value pairs, which is sometimes useful, but if you know that you do not need it, this will be done above.

You can even get rid of if hasattr('items', params): if you know that you do not need to process lists of 2 tuples, as well as dicts.

+2
Nov 16
source share

I solved this with this add_get_to_url() method:

 import urllib def add_get_to_url(url, get): return '%s?%s' % (url, urllib.urlencode(list(encode_dict_to_bytes(get)))) def encode_dict_to_bytes(query): if hasattr(query, 'items'): query=query.items() for key, value in query: yield (encode_value_to_bytes(key), encode_value_to_bytes(value)) def encode_value_to_bytes(value): if not isinstance(value, unicode): return str(value) return value.encode('utf8') 

Features:

  • "get" can be a dict or list (key, value) of a pair
  • Order not lost
  • values ​​can be integer or other simple data types.

Feedback.

+1
Feb 25 '16 at 11:22
source share

this one line works fine in my case ->

 urllib.quote(unicode_string.encode('utf-8')) 

thanks @IanCleland and @PavelVlasov

-one
Jul 11 '17 at 3:07 on
source share

Why so long answers?

urlencode(unicode_string.encode('utf-8'))

-four
May 27 '12 at 8:33
source share



All Articles