Python throws a UnicodeEncodeError, although I do str.decode (). What for?

Question

Python throws a UnicodeEncodeError, although I do str.decode (). What for?

Consider this function:

def escape(text): print repr(text) escaped_chars = [] for c in text: try: c = c.decode('ascii') except UnicodeDecodeError: c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)]) escaped_chars.append(c) return ''.join(escaped_chars)

It should avoid all characters without ascii appropriate htmlentitydefs. Sorry, python throws

 UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128)

when the text variable contains a string whose repr() is u'Tam\xe1s Horv\xe1th' .

But I do not use str.encode() . I use only str.decode() . Did I miss something?

+4

python string encoding escaping

Aufwind Dec 21 '11 at 13:55

source share

6 answers

This is a misleading error report that comes from how python handles the de / encoding process. You tried again to decode the already decoded String, and this confuses the Python function, which repeats, confusing you one by one! ;-) The encoding / decoding process occurs, as far as I know, by the codec module. And somewhere lies the start of this misleading exception message.

You can check yourself: either

 u'\x80'.encode('ascii')

or

 u'\x80'.decode('ascii')

throws a Unicode Encode error, where

 u'\x80'.encode('utf8')

will not be but

 u'\x80'.decode('utf8')

will be again!

I assume that you are confusing the meaning of encoding and decoding. Simply put:

  decode encode ByteString (ascii) --------> UNICODE ---------> ByteString (utf8) codec codec

But why is there a codec argument for the decode method? Well, the main function cannot guess which codec was encoded bytestring, since codec is required as an argument. If not specified, it is assumed that you are implying that sys.getdefaultencoding() implicitly used.

so when you use c.decode('ascii') you a) have a (encoded) ByteString (which is why you use decoding). b) you want to get a unicode-presentation-object (which is what decoding is used for), and c) the codec in which the ByteString is encoded is ascii.

See also: fooobar.com/questions/11845 / ...
http://docs.python.org/howto/unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror

+11

Don question Dec 21 '11 at 15:51

source share

You pass a string that is already unicode. So, before Python can call decode on it, it must really encode it - and it does this by default using ASCII encoding.

Edit to add . It depends on what you want to do. If you just want to convert a unicode string with non-ASCII characters to an encoded HTML representation, you can do this in one call: text.encode('ascii', 'xmlcharrefreplace') .

+5

Daniel Roseman Dec 21 '11 at 14:07

source share

This answer always works for me when I have this problem:

 def byteify(input): ''' Removes unicode encodings from the given input string. ''' if isinstance(input, dict): return {byteify(key):byteify(value) for key,value in input.iteritems()} elif isinstance(input, list): return [byteify(element) for element in input] elif isinstance(input, unicode): return input.encode('utf-8') else: return input

from How to get string objects instead of Unicode from JSON in Python?

+1

Blairg23 Nov 26 '15 at 21:58

source share

I found a solution in this-site

  reload (sys)
 sys.setdefaultencoding ("latin-1")

 a = u '\ xe1'
 print str (a) # no exception

0

Heladio cisneros reyes Jul 22 '16 at 16:13

source share

decode a str does not make sense.

I think you can check ord(c)>127

-2

kev Dec 21 '11 at 14:17

source share

wberry · Accepted Answer · 2011-12-21T14:39:28+0000

Python has two types of strings: character strings ( unicode type) and byte strings ( str type). The code you inserted works on byte lines. You need a similar function to handle character strings.

Perhaps it:

 def uescape(text): print repr(text) escaped_chars = [] for c in text: if (ord(c) < 32) or (ord(c) > 126): c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)]) escaped_chars.append(c) return ''.join(escaped_chars)

I really wonder if you really need a function. If it were me, I would choose UTF-8 as the character encoding for the resulting document, process the document in the form of a character string (without worrying about entities), and execute content.encode('UTF-8') as the last step before how to deliver it to the client. Depending on the web structure used, you can even pass personal strings directly to the API and figure out how to set the encoding.

Python throws a UnicodeEncodeError, although I do str.decode (). What for?

More articles: