Detecting incorrect character encodings using python

I am new to serious programming, and I tried to write a python program where I met lines in this form when reading from a file:

Îeke Mölçè - Sunday
Òð è è å - - - - Modern Heating

which should actually be in Cyrillic (cp-1251), so this line is a victim of incorrect encoding (I found it after a long search using this site: Universal Cyrillic decoder )

Also using the discovery function in the chardet module can find it

chardet.detect('Îêåàí Åëüçè - Ìàéæå âåñíà'.decode('utf-8').encode('windows-1252')) 

which gives:
{'trust': 0.7679697235616183, 'encoding': 'windows-1251'}

after doing the following, I can get the suggested string

 string.decode('utf-8').encode('windows-1252').decode('windows-1251').encode('utf-8') 

which gives:

Ocean Elzy - Maige Spring and
Metal Corrosion - War of the Worlds

respectively for the above lines.

My question is: is there a way to detect such strings? Here are some other lines that I did not even find a way to fix:

Isao Sasaki - ¨¬¡Æ¨¬¡ÆAI ¨ ¬ Æ (Another farewell) (¡¾ ¢ ¬ ¬¬¬¬¬ ¾ ¾ ¾ ¾
) Yoon K. Lee and Salzburg Kammerp - ³ "¸¶À½
⁂ 晉 䤠 圠 牥 ⁥⁡ 潂 ⁹ 䬨 牡 慭 牴 湯 捩 删 浥 硩 䴠 楡 ⥮
à ÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃ

Thanks so much for your answers.

+4
source share
1 answer

Well, this cyrillic line is not in cp-1251. As you seem to know, it was encoded "twice." Most likely, someone took the binary string in cp1251, assuming it is in utf8 and encoded in cp1252, or something like that.

No automatic check can understand this.

 >>> print 'Îêåàí Åëüçè - Ìàéæå âåñíà'.decode('utf8').encode('latin1').decode('cp1251')   -   

works. The latter looks like UTF8 because it supports both single and multibyte characters, but not UTF8. so again, an incorrect conversion was done. Performing all possible combinations until one job is likely to be the only opportunity.

+4
source

Source: https://habr.com/ru/post/1336000/


All Articles