What if python unicode object was decoded incorrectly

I have a unicode variable like unicodeVar. For example, u'\xea\xb1\xb8\xec\x8a\xa4\xeb\x8d\xb0\xec\x9d\xb4 \xeb\xaf\xb8\xeb\x8b\x88\xec\x95\xa8\xeb\xb2\x94 3\xec\xa7\x91' .

This is what it looks like when I just type unicodeVar in the console. In fact, this view is 걸스 데이 미니 앨범 3 집. Yes! this is korean. Apparently, this variable is incorrectly decoded in unicode. And I never get this in my program using unicodeVar. What is displayed above is the result

'\xea\xb1\xb8\xec\x8a\xa4\xeb\x8d\xb0\xec\x9d\xb4 \xeb\xaf\xb8\xeb\x8b\x88\xec\x95\xa8\xeb\xb2\x94 3\xec\xa7\x91'.decode('utf-8')

If I do this:, the unicodeVar.decode('unicode-escape')result will be a double slash string. how'\\xea\\xb1\\xb8\\xec\\x8a\\xa4\\xeb\\x8d\\xb0\\xec\\x9d\\xb4 \\xeb\\xaf\\xb8\\xeb\\x8b\\x88\\xec\\x95\\xa8\\xeb\\xb2\\x94 3\\xec\\xa7\\x91'

The question is, how can I get the correct representation from a variable? Does this mean using only unicodeVar?

+4
source share
2

, latin1 encoding, . , (utf-8 ):

>>> s = u'\xea\xb1\xb8\xec\x8a\xa4\xeb\x8d\xb0\xec\x9d\xb4\xeb\xaf\xb8\xeb\x8b\x88\xec\x95\xa8\xeb\xb2\x94 3\xec\xa7\x91'
>>> print(s.encode('latin1').decode('utf-8'))
걸스데이미니앨범 3

?

( ) utf-8 latin1, , latin1 utf-8. .

>>> utf_8_bytes = u'걸스데이미니앨범 3집'.encode('utf-8')
>>> utf_8_bytes.decode('latin1')
u'\xea\xb1\xb8\xec\x8a\xa4\xeb\x8d\xb0\xec\x9d\xb4\xeb\xaf\xb8\xeb\x8b\x88\xec\x95\xa8\xeb\xb2\x94 3\xec\xa7\x91'
+4

u , u , utf-8, unicode:

>>> print '\xea\xb1\xb8\xec\x8a\xa4\xeb\x8d\xb0\xec\x9d\xb4 \xeb\xaf\xb8\xeb\x8b\x88\xec\x95\xa8\xeb\xb2\x94 3\xec\xa7\x91'.decode('utf-8')
걸스데이미니앨범 3
+1

Source: https://habr.com/ru/post/1526355/


All Articles