Python throws a UnicodeEncodeError, although I do str.decode (). What for?

Consider this function:

def escape(text): print repr(text) escaped_chars = [] for c in text: try: c = c.decode('ascii') except UnicodeDecodeError: c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)]) escaped_chars.append(c) return ''.join(escaped_chars) 

It should avoid all characters without ascii appropriate htmlentitydefs. Sorry, python throws

 UnicodeEncodeError: 'ascii' codec can't encode character u'\xe1' in position 0: ordinal not in range(128) 

when the text variable contains a string whose repr() is u'Tam\xe1s Horv\xe1th' .

But I do not use str.encode() . I use only str.decode() . Did I miss something?

+4
source share
6 answers

Python has two types of strings: character strings ( unicode type) and byte strings ( str type). The code you inserted works on byte lines. You need a similar function to handle character strings.

Perhaps it:

 def uescape(text): print repr(text) escaped_chars = [] for c in text: if (ord(c) < 32) or (ord(c) > 126): c = '&{};'.format(htmlentitydefs.codepoint2name[ord(c)]) escaped_chars.append(c) return ''.join(escaped_chars) 

I really wonder if you really need a function. If it were me, I would choose UTF-8 as the character encoding for the resulting document, process the document in the form of a character string (without worrying about entities), and execute content.encode('UTF-8') as the last step before how to deliver it to the client. Depending on the web structure used, you can even pass personal strings directly to the API and figure out how to set the encoding.

+2
source

This is a misleading error report that comes from how python handles the de / encoding process. You tried again to decode the already decoded String, and this confuses the Python function, which repeats, confusing you one by one! ;-) The encoding / decoding process occurs, as far as I know, by the codec module. And somewhere lies the start of this misleading exception message.

You can check yourself: either

 u'\x80'.encode('ascii') 

or

 u'\x80'.decode('ascii') 

throws a Unicode Encode error, where

 u'\x80'.encode('utf8') 

will not be but

 u'\x80'.decode('utf8') 

will be again!

I assume that you are confusing the meaning of encoding and decoding. Simply put:

  decode encode ByteString (ascii) --------> UNICODE ---------> ByteString (utf8) codec codec 

But why is there a codec argument for the decode method? Well, the main function cannot guess which codec was encoded bytestring, since codec is required as an argument. If not specified, it is assumed that you are implying that sys.getdefaultencoding() implicitly used.

so when you use c.decode('ascii') you a) have a (encoded) ByteString (which is why you use decoding). b) you want to get a unicode-presentation-object (which is what decoding is used for), and c) the codec in which the ByteString is encoded is ascii.

See also: fooobar.com/questions/11845 / ...
http://docs.python.org/howto/unicode.html
http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
http://www.stereoplex.com/blog/python-unicode-and-unicodedecodeerror

+11
source

You pass a string that is already unicode. So, before Python can call decode on it, it must really encode it - and it does this by default using ASCII encoding.

Edit to add . It depends on what you want to do. If you just want to convert a unicode string with non-ASCII characters to an encoded HTML representation, you can do this in one call: text.encode('ascii', 'xmlcharrefreplace') .

+5
source

This answer always works for me when I have this problem:

 def byteify(input): ''' Removes unicode encodings from the given input string. ''' if isinstance(input, dict): return {byteify(key):byteify(value) for key,value in input.iteritems()} elif isinstance(input, list): return [byteify(element) for element in input] elif isinstance(input, unicode): return input.encode('utf-8') else: return input 

from How to get string objects instead of Unicode from JSON in Python?

+1
source

I found a solution in this-site

  reload (sys)
 sys.setdefaultencoding ("latin-1")

 a = u '\ xe1'
 print str (a) # no exception
0
source

decode a str does not make sense.

I think you can check ord(c)>127

-2
source

Source: https://habr.com/ru/post/910644/


All Articles