What happens when encode is used on str in python?

I got information about Unicode, encoding and decryption. But I do not understand why the encoding function works on type str. I expected it to work only as Unicode. So my question is: what is the coding behavior when it is used on str rather than unicode?

+5
source share
3 answers

In Python 2, two types of codecs are available; those that convert between str and unicode , and those that convert from str to str . Examples of the latter are base64 and rot13 .

There is a str.encode() method to support the latter:

 'binary data'.encode('base64') 

But now that it exists, people also use it for unicode str codecs; encoding can only go from unicode to str (and decoding in another way). To support them, Python will implicitly decode your str value to unicode first, using the ASCII codec, before finally encoding.

By the way, when using the strstr codec in a unicode object, Python first implicitly encodes the str code using the same ASCII codec.

In Python 3, this was solved: a) by removing the bytes.encode() and str.decode() methods (remember that bytes sorts the old str and str new unicode ), and b) by moving the encodings strstr only to the codecs module, using codecs.encode() and codecs.decode() . For which codecs converted between the same type have also been refined and updated, see the section "Specific Code Names in Python" ; note that the "text" encodings noted there, if available in Python 2, are encoded to str instead.

+5
source

Python understands that it cannot use encode for type str , so it tries to decode it first! It uses the 'ascii' codec, which will fail if you have characters with a code number above 0x7f.

This is why you sometimes see a decode error when trying to make an encode .

+4
source

In Python 3, byte string encoding just doesn't work.

 >>> b'hi'.encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> AttributeError: 'bytes' object has no attribute 'encode' 

Python 2 tries to help when you call encode on str and first try to decode the string using sys.getdefaultencoding() (usually ascii) and then encode it.

That's why you get a rather strange error message that decoding with ascii is not possible when trying to encode using utf-8.

 >>> 'hi\xFF'.encode('utf-8') Traceback (most recent call last): File "<stdin>", line 1, in <module> UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 2: ordinal not in range(128) 

Ned explains it better than me, see this one from 16:20 onwards.

+3
source

Source: https://habr.com/ru/post/1243981/


All Articles