Why is the print output in python2 and python3 different from the same line?

In python2:

$ python2 -c 'print "\x08\x04\x87\x18"' | hexdump -C
00000000  08 04 87 18 0a                                    |.....|
00000005

In python3:

$ python3 -c 'print("\x08\x04\x87\x18")' | hexdump -C
00000000  08 04 c2 87 18 0a                                 |......|
00000006

Why is this byte "\xc2"here?

Edit

I think that when a string has a non-ascii character, python3 will add a byte "\xc2"to the string. (as @Ashraful Islam said)

So how can I avoid this in python3?

+6
source share
2 answers

Consider the following code snippet:

import sys
for i in range(128, 256):
    sys.stdout.write(chr(i))

Run this with Python 2 and look at the result with hexdump -C:

00000000  80 81 82 83 84 85 86 87  88 89 8a 8b 8c 8d 8e 8f  |................|

Et cetera. No surprises; 128 bytes from 0x80to 0xff.

Do the same with Python 3:

00000000  c2 80 c2 81 c2 82 c2 83  c2 84 c2 85 c2 86 c2 87  |................|
...
00000070  c2 b8 c2 b9 c2 ba c2 bb  c2 bc c2 bd c2 be c2 bf  |................|
00000080  c3 80 c3 81 c3 82 c3 83  c3 84 c3 85 c3 86 c3 87  |................|
...
000000f0  c3 b8 c3 b9 c3 ba c3 bb  c3 bc c3 bd c3 be c3 bf  |................|

Summarizing:

  • Everything from 0x80to 0xbfhas 0xc2added.
  • 0xc0 0xff 6, 0xc3 .

, ?

Python 2 ASCII, . - 0-127 ASCII, "okey-doke!" . .

Python 3 Unicode. , ASCII, - . UTF-8.

, UTF-8?

0x80 0x7ff :

110vvvvv 10vvvvvv

11 v .

:

0x80                 hex
1000 0000            8-bit binary
000 1000 0000        11-bit binary
00010 000000         divide into vvvvv vvvvvv
11000010 10000000    resulting UTF-8 octets in binary
0xc2 0x80            resulting UTF-8 octets in hex

0xc0                 hex
1100 0000            8-bit binary
000 1100 0000        11-bit binary
00011 000000         divide into vvvvv vvvvvv
11000011 10000000    resulting UTF-8 octets in binary
0xc3 0x80            resulting UTF-8 octets in hex

, c2 87.

Python 3? bytes.

+9

Python 2 - . "abc", Unicode u"abc".

Python 3 - Unicode. b"abc", Unicode "abc" (u"abc" ). , (UTF-8 ), .

Python 3, Python 2. , Python 3 print Unicode, sys.stdout.buffer.write stdout, .

python3 -c 'import sys; sys.stdout.buffer.write(b"\x08\x04\x87\x18")'

, . 'wb' .

+1

Source: https://habr.com/ru/post/1015869/


All Articles