You really have to be nervous. The whole idea that you might have a mixture of bytes and text in some data structure is horrific. This violates the fundamental principle of working with string data: decoding during input, working exclusively in Unicode, encoding at the output.
Update in response to comment:
You are about to output some kind of HTTP request. This needs to be prepared as a string of bytes. The fact that urllib.urlencode is not able to properly prepare this byte string if your dict has Unicode characters with serial number = 128 is really unsuccessful. If you have a mixture of byte strings and unicode strings in your dict, you should be careful. Consider only what urlencode () does:
>>> import urllib >>> tests = ['\x80', '\xe2\x82\xac', 1, '1', u'1', u'\x80', u'\u20ac'] >>> for test in tests: ... print repr(test), repr(urllib.urlencode({'a':test})) ... '\x80' 'a=%80' '\xe2\x82\xac' 'a=%E2%82%AC' 1 'a=1' '1' 'a=1' u'1' 'a=1' u'\x80' Traceback (most recent call last): File "<stdin>", line 2, in <module> File "C:\python27\lib\urllib.py", line 1282, in urlencode v = quote_plus(str(v)) UnicodeEncodeError: 'ascii' codec can't encode character u'\x80' in position 0: ordinal not in range(128)
The last two tests demonstrate a problem with urlencode (). Now let's look at the str tests.
If you insist on having a mixture, you should at least ensure that str objects are encoded in UTF-8.
'\ x80' is suspicious - it is not the result of any_valid_unicode_string.encode ('utf8').
'\ xe2 \ x82 \ xac' is OK; this is the result of u '\ u20ac'.encode (' utf8 ').
"1" in order - all ASCII characters in order, at the input to urlencode (), which will be, if necessary, percent encoding, such as "%".
The proposed converter function is offered here. It does not mutate the input signal and does not return it (as yours does); it returns a new dict. It throws an exception if the value is a str object but is not a valid UTF-8 string. By the way, your concern about this, without resorting to nested objects, is a little incorrectly indicated - your code only works with dicts, and the concept of nested dicts really does not fly.
def encoded_dict(in_dict): out_dict = {} for k, v in in_dict.iteritems(): if isinstance(v, unicode): v = v.encode('utf8') elif isinstance(v, str): # Must be encoded in UTF-8 v.decode('utf8') out_dict[k] = v return out_dict
and here's the conclusion, using the same tests in reverse order (because this time nasty at the front):
>>> for test in tests[::-1]: ... print repr(test), repr(urllib.urlencode(encoded_dict({'a':test}))) ... u'\u20ac' 'a=%E2%82%AC' u'\x80' 'a=%C2%80' u'1' 'a=1' '1' 'a=1' 1 'a=1' '\xe2\x82\xac' 'a=%E2%82%AC' '\x80' Traceback (most recent call last): File "<stdin>", line 2, in <module> File "<stdin>", line 8, in encoded_dict File "C:\python27\lib\encodings\utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 0: invalid start byte >>>
Does it help?