Python: UnicodeDecodeError: codec 'utf8' cannot decode byte 0xc0 at position 0: invalid start byte

I am trying to write a script that generates random unicode by creating random utf-8 encoded strings and then decodes them in unicode. It works fine for a single byte, but with two bytes it fails.

For example, if I ran the following in a python shell:

>>> a = str()

>>> a += chr(0xc0) + chr(0xaf)

>>> print a.decode('utf-8')

UnicodeDecodeError: 'utf8' codec can't decode byte 0xc0 in position 0: invalid start byte

According to the utf-8 scheme, https://en.wikipedia.org/wiki/UTF-8#Description , the byte sequence 0xc0 0xafmust be valid, since it 0xc0starts with 110and 0xafstarts with 10.


Here is my python script:

def unicode(self):
    '''returns a random (astral) utf encoded byte string'''
    num_bytes = random.randint(1,4)
    if num_bytes == 1:
        return self.gen_utf8(num_bytes, 0x00, 0x7F)
    elif num_bytes == 2:
        return self.gen_utf8(num_bytes, 0xC0, 0xDF)
    elif num_bytes == 3:
        return self.gen_utf8(num_bytes, 0xE0, 0xEF)
    elif num_bytes == 4:
        return self.gen_utf8(num_bytes, 0xF0, 0xF7)

def gen_utf8(self, num_bytes, start_val, end_val):
    byte_str = list()
    byte_str.append(random.randrange(start_val, end_val)) # start byte
    for i in range(0,num_bytes-1):
        byte_str.append(random.randrange(0x80,0xBF)) # trailing bytes
    a = str()
    sum = int()
    for b in byte_str:
        a += chr(b) 
    ret = a.decode('utf-8')
    return ret

if __name__ == "__main__":
    g = GenFuzz()
    print g.gen_utf8(2,0xC0,0xDF)
+4
2

UTF-8. UTF-8 U + 0080 U + 07FF . , . , 0xc0 UTF-8, -. 0xc1.

UTF-8 , C0 AF UTF-8, .

+5

, 0xc0: encoding="ISO-8859-1"
fooobar.com/questions/1541362/...

, unicode, , , , , python , utf-8 ascii .

ISO-8859-1: UTF-8 ISO-8859-1?

0

Source: https://habr.com/ru/post/1541360/


All Articles