How to decode ascii string with backslash codes x \ x

Question

How to decode ascii string with backslash codes x \ x

I am trying to decipher the text of a Brazilian Portuguese:

'Demais Subfun \ xc3 \ xa7 \ xc3 \ xb5es 12'

It should be

'Demais Subfunções 12'

>> a.decode('unicode_escape') >> a.encode('unicode_escape') >> a.decode('ascii') >> a.encode('ascii')

all give:

 UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 13: ordinal not in range(128)

on the other hand, it gives:

 >> print a.encode('utf-8') Demais Subfun├â┬º├â┬Áes 12 >> print a Demais SubfunÃ§Ãµes 12

+5

python string unicode python-2.x

Davoud taghawi-nejad 24 sept '15 at 12:57

source share

1 answer

Martijn pieters · Accepted Answer · 2015-09-24T12:59:08+0000

You have binary data that is not ASCII encoded. The code pages \xhh indicate that your data is encoded using a different codec, and you see that Python creates a representation of the data using the repr() function , which can be reused as a Python literal that allows you to accurately recreate the same value. This view is very useful when debugging a program.

In other words, the escape sequences \xhh represent individual bytes, and hh is the hexadecimal value of this byte. You have 4 bytes with the hexadecimal values C3, A7, C3, and B5 that are not mapped to printable ASCII characters, so Python uses the \xhh note instead.

Instead, you have UTF-8 data, decode it as such:

 >>> 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8') u'Demais Subfun\xe7\xf5es 12' >>> print 'Demais Subfun\xc3\xa7\xc3\xb5es 12'.decode('utf8') Demais Subfunções 12

bytes C3 A7 together encode U + 00E7 LATIN SMALL LETTER C WITH CEDILLA , and bytes C3 B5 encode U + 00F5 LATIN SMALL LETTER O TILDE .

ASCII is a subset of the UTF-8 codec, so all other letters can be represented as such in the Python repr() output.

How to decode ascii string with backslash codes x \ x

More articles: