How to decode ascii string with backslash codes x \ x

I am trying to decipher the text of a Brazilian Portuguese:

'Demais Subfun \ xc3 \ xa7 \ xc3 \ xb5es 12'

It should be

'Demais Subfunções 12'

>> a.decode('unicode_escape') >> a.encode('unicode_escape') >> a.decode('ascii') >> a.encode('ascii') 

all give:

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128) 

on the other hand, it gives:

 >> print a.encode('utf-8') Demais Subfun├â┬º├â┬Áes 12 >> print a Demais Subfunções 12 
+5
source share
1 answer

You have binary data that is not ASCII encoded. The code pages \xhh indicate that your data is encoded using a different codec, and you see that Python creates a representation of the data using the repr() function , which can be reused as a Python literal that allows you to accurately recreate the same value. This view is very useful when debugging a program.

In other words, the escape sequences \xhh represent individual bytes, and hh is the hexadecimal value of this byte. You have 4 bytes with the hexadecimal values ​​C3, A7, C3, and B5 that are not mapped to printable ASCII characters, so Python uses the \xhh note instead.

Instead, you have UTF-8 data, decode it as such:

 >>> 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8') u'Demais Subfun\xe7\xf5es 12' >>> print 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8') Demais Subfunções 12 

bytes C3 A7 together encode U + 00E7 LATIN SMALL LETTER C WITH CEDILLA , and bytes C3 B5 encode U + 00F5 LATIN SMALL LETTER O TILDE .

ASCII is a subset of the UTF-8 codec, so all other letters can be represented as such in the Python repr() output.

+13
source

Source: https://habr.com/ru/post/1232188/


All Articles