I have the following function to parse utf-8 string from byte sequence
Note. "length_size" is the number of bytes it takes to represent utf-8 string length
def parse_utf8(self, bytes, length_size):
length = bytes2int(bytes[0:length_size])
value = ''.join(['%c' % b for b in bytes[length_size:length_size+length]])
return value
def bytes2int(raw_bytes, signed=False):
"""
Convert a string of bytes to an integer (assumes little-endian byte order)
"""
if len(raw_bytes) == 0:
return None
fmt = {1:'B', 2:'H', 4:'I', 8:'Q'}[len(raw_bytes)]
if signed:
fmt = fmt.lower()
return struct.unpack('<'+fmt, raw_bytes)[0]
I would like to write a function in the reverse order - that is, a function that takes a utf-8 encoded string and returns its representation as a byte string.
So far I have the following:
def create_utf8(self, utf8_string):
return utf8_string.encode('utf-8')
When trying to test this error, I encountered the following error:
File "writer.py", line 229, in create_utf8
return utf8_string.encode('utf-8')
UnicodeDecodeError: 'ascii' codec can't decode byte 0x98 in position 0: ordinal not in range(128)
If possible, I would like to adopt a structure for code similar to the parse_utf8 example. What am I doing wrong?
Thank you for your help!
UPDATE: test driver, now correct
def random_utf8_seq(self, length):
test_charset = u" !\"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~ ¡¢£¤¥¦§¨©ª«¬ ®¯°±²³´µ¶·¸¹º»¼½¾¿ÀÁÂÃÄÅÆÇÈÉÊËÌÍÎÏÐÑÒÓÔÕÖרÙÚÛÜÝÞßàáâãäåæçèéêëìíîïðñòóôõö÷øùúûüýþÿĂ㥹ĆćČčĎďĐđĘęĚěĹĺĽľŁłŃńŇňŐőŒœŔŕŘřŚśŞşŠšŢţŤťŮůŰűŸŹźŻżŽžƒˆˇ˘˙˛˜˝–—‘’‚""„†‡•…‰‹›€™"
utf8_seq = u""
for i in range(length):
utf8_seq += random.choice(test_charset)
return utf8_seq
I get the following error:
input_str = self.random_utf8_seq(200)
File "writer.py", line 226, in random_utf8_seq
print unicode(utf8_seq, "utf-8")
UnicodeDecodeError: 'utf8' codec can't decode byte 0xbb in position 0: invalid start byte