Type UTF-8 character in Python 2.7

This is how I open, read and deduce. The file is a UTF-8 encoded file for Unicode characters. I want to print the first 10 characters of UTF-8, but the bottom output of the code snippet prints 10 strange unrecognized characters. I wonder if anyone has any ideas on how to type correctly? Thank.

   with open(name, 'r') as content_file:
        content = content_file.read()
        for i in range(10):
            print content[i]

Each of the 10 weird characters looks like this:

 

Regards, Lin

+4
source share
2 answers

Unicode () UTF-8, , . 7- ASCII , .

, , UTF-8 . , .

©, ® ™, 2, 2 3 UTF-8. UTF-8.

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
print utfbytes, len(utfbytes)
for b in utfbytes:
    print b, repr(b)

uni = utfbytes.decode('utf-8')
print uni, len(uni)

© ® ™ 9                                                                                                                                        
  '\xc2'                                                                                                                                       
  '\xa9'                                                                                                                                       
  ' '
  '\xc2'
  '\xae'
  ' '
  '\xe2'
  '\x84'
  '\xa2'
© ® ™ 5

Stack Overflow, , Unicode: , ( !)

Unicode HOWTO Python Ned Batchelder Pragmatic Unicode, "Unipain".


UTF-8. , , , .

utfbytes = "\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    print "%d %d [%s]" % (start, w, utfbytes[start:start+w])
    start += w

0 2 [©]
2 1 [ ]
3 2 [®]
5 1 [ ]
6 3 [™]

FWIW, Python 3 :

utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
widths = (2, 1, 2, 1, 3)
start = 0
for w in widths:
    s = utfbytes[start:start+w]
    print("%d %d [%s]" % (start, w, s.decode()))
    start += w

UTF-8, . UTF-8 , UTF-8.

Python 2 , ; , .

# UTF-8 code widths
#width starting byte
#1 0xxxxxxx
#2 110xxxxx
#3 1110xxxx
#4 11110xxx
#C 10xxxxxx

def get_width(b):
    if b <= '\x7f':
        return 1
    elif '\x80' <= b <= '\xbf':
        #Continuation byte
        raise ValueError('Bad alignment: %r is a continuation byte' % b)
    elif '\xc0' <= b <= '\xdf':
        return 2
    elif '\xe0' <= b <= '\xef':
        return 3
    elif '\xf0' <= b <= '\xf7':
        return 4
    else:
        raise ValueError('%r is not a single byte' % b)


utfbytes = b"\xc2\xa9 \xc2\xae \xe2\x84\xa2"
start = 0
while start < len(utfbytes):
    b = utfbytes[start]
    w = get_width(b)
    s = utfbytes[start:start+w]
    print "%d %d [%s]" % (start, w, s)
    start += w

, : .


, Python 3 get_width , UTF-8.

def get_width(b):
    if b <= 0x7f:
        return 1
    elif 0x80 <= b <= 0xbf:
        #Continuation byte
        raise ValueError('Bad alignment: %r is a continuation byte' % b)
    elif 0xc0 <= b <= 0xdf:
        return 2
    elif 0xe0 <= b <= 0xef:
        return 3
    elif 0xf0 <= b <= 0xf7:
        return 4
    else:
        raise ValueError('%r is not a single byte' % b)

def decode_utf8(utfbytes):
    start = 0
    uni = []
    while start < len(utfbytes):
        b = utfbytes[start]
        w = get_width(b)
        if w == 1:
            n = b
        else:
            n = b & (0x7f >> w)
            for b in utfbytes[start+1:start+w]:
                if not 0x80 <= b <= 0xbf:
                    raise ValueError('Not a continuation byte: %r' % b)
                n <<= 6
                n |= b & 0x3f
        uni.append(chr(n))
        start += w
    return ''.join(uni)


utfbytes = b'\xc2\xa9 \xc2\xae \xe2\x84\xa2'
print(utfbytes.decode('utf8'))
print(decode_utf8(utfbytes))

© ® ™
© ® ™

+10

Unicode , . Python - ASCII, , ​​ UTF-8:

s = unicode(your_object).encode('utf8')
print s
+3

Source: https://habr.com/ru/post/1649496/


All Articles