How to fix corrupted utf-8 encoding in Python?

My string is Niệm Bá»' Tát (Thiá»n sư Nhất Hạnh) , and I want to decode it into Niệm Bồ Tát (Thiền sư Nhất Hạnh) . I see that on this site you can do this http://www.enderminh.com/minh/utf8-to-unicode-converter.aspx

and I start trying in Python

 mystr = '09. Bát Nhã Tâm Kinh' mystr.decode('utf-8') 

but this is actually not correct, because the original line is utf-8, but the show line is not my expected result.

Note: this is a Vietnamese symbol.

How to solve this case? Is it Windows Unicode or something? How to determine the encoding here.

+8
source share
3 answers

I'm not sure what you can do with this data, but for your example in your original post, this works:

 >>> mystr = '09. Bát Nhã Tâm Kinh' >>> s = mystr.decode('utf8').encode('latin1').decode('utf8') >>> s u'09. B\xe1t Nh\xe3 T\xe2m Kinh' >>> print(s) 09. Bát Nhã Tâm Kinh 
+10
source

The only thing that helped me with the broken Cyrillic alphabet is https://github.com/LuminosoInsight/python-ftfy

This module fixes almost everything and works much better than online decoders.

 >>> from ftfy import fix_encoding >>> mystr = '09. Bát Nhã Tâm Kinh' >>> fix_encoding(mystr) '09. Bát Nhã Tâm Kinh' 

It can be easily installed using pip install ftfy

+12
source

Try:

str.encode('ascii', 'ignore').decode('utf-8')

You encode the string in ASCII format / ignore errors and decode in UTF-8. This may take away the emphasis, but this is one approach.

0
source

Source: https://habr.com/ru/post/1205201/


All Articles