How to fix corrupted utf-8 encoding in Python?

Question

How to fix corrupted utf-8 encoding in Python?

My string is Niá»‡m Bá»' TÃ¡t (Thiá»n sÆ° Nháº¥t Háº¡nh) , and I want to decode it into Niệm Bồ Tát (Thiền sư Nhất Hạnh) . I see that on this site you can do this http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx

and I start trying in Python

 mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh' mystr.decode('utf-8')

but this is actually not correct, because the original line is utf-8, but the show line is not my expected result.

Note: this is a Vietnamese symbol.

How to solve this case? Is it Windows Unicode or something? How to determine the encoding here.

+8

python unicode utf-8 character-encoding

giaosudau Oct 21 '14 at 16:17

source share

3 answers

The only thing that helped me with the broken Cyrillic alphabet is https://github.com/LuminosoInsight/python-ftfy

This module fixes almost everything and works much better than online decoders.

 >>> from ftfy import fix_encoding >>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh' >>> fix_encoding(mystr) '09. Bát Nhã Tâm Kinh'

It can be easily installed using pip install ftfy

+12

Dima Rostopira Oct 6 '16 at 19:42

source share

Try:

str.encode('ascii', 'ignore').decode('utf-8')

You encode the string in ASCII format / ignore errors and decode in UTF-8. This may take away the emphasis, but this is one approach.

0

Walter Oct 15 '19 at 2:34

source share

Jonathan ballet · Accepted Answer · 2014-10-21T17:27:17+0000

I'm not sure what you can do with this data, but for your example in your original post, this works:

 >>> mystr = '09. BÃ¡t NhÃ£ TÃ¢m Kinh' >>> s = mystr.decode('utf8').encode('latin1').decode('utf8') >>> s u'09. B\xe1t Nh\xe3 T\xe2m Kinh' >>> print(s) 09. Bát Nhã Tâm Kinh

How to fix corrupted utf-8 encoding in Python?

More articles: