I am new to serious programming, and I tried to write a python program where I met lines in this form when reading from a file:
Îeke Mölçè - Sunday
Òð è è å - - - - Modern Heating
which should actually be in Cyrillic (cp-1251), so this line is a victim of incorrect encoding (I found it after a long search using this site: Universal Cyrillic decoder )
Also using the discovery function in the chardet module can find it
chardet.detect('Îêåàí Åëüçè - Ìàéæå âåñíà'.decode('utf-8').encode('windows-1252'))
which gives:
{'trust': 0.7679697235616183, 'encoding': 'windows-1251'}
after doing the following, I can get the suggested string
string.decode('utf-8').encode('windows-1252').decode('windows-1251').encode('utf-8')
which gives:
Ocean Elzy - Maige Spring and
Metal Corrosion - War of the Worlds
respectively for the above lines.
My question is: is there a way to detect such strings? Here are some other lines that I did not even find a way to fix:
Isao Sasaki - ¨¬¡Æ¨¬¡ÆAI ¨ ¬ Æ (Another farewell) (¡¾ ¢ ¬ ¬¬¬¬¬ ¾ ¾ ¾ ¾
) Yoon K. Lee and Salzburg Kammerp - ³ "¸¶À½
⁂ 晉 䤠 圠 牥 潂 ⁹ 䬨 牡 慭 牴 湯 捩 删 浥 硩 䴠 楡 ⥮
à ÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃ
Thanks so much for your answers.
source share