Detecting incorrect character encodings using python

Question

Detecting incorrect character encodings using python

I am new to serious programming, and I tried to write a python program where I met lines in this form when reading from a file:

Îeke Mölçè - Sunday
Òð è è å - - - - Modern Heating

which should actually be in Cyrillic (cp-1251), so this line is a victim of incorrect encoding (I found it after a long search using this site: Universal Cyrillic decoder )

Also using the discovery function in the chardet module can find it

chardet.detect('Îêåàí Åëüçè - Ìàéæå âåñíà'.decode('utf-8').encode('windows-1252'))

which gives:
{'trust': 0.7679697235616183, 'encoding': 'windows-1251'}

after doing the following, I can get the suggested string

 string.decode('utf-8').encode('windows-1252').decode('windows-1251').encode('utf-8')

which gives:

Ocean Elzy - Maige Spring and
Metal Corrosion - War of the Worlds

respectively for the above lines.

My question is: is there a way to detect such strings? Here are some other lines that I did not even find a way to fix:

Isao Sasaki - ¨¬¡Æ¨¬¡ÆAI ¨ ¬ Æ (Another farewell) (¡¾ ¢ ¬ ¬¬¬¬¬ ¾ ¾ ¾ ¾
) Yoon K. Lee and Salzburg Kammerp - ³ "¸¶À½
⁂ 晉䤠圠牥 ⁥⁡ 潂 ⁹ 䬨牡慭牴湯捩删浥硩䴠楡 ⥮
Ã ÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃÃ

Thanks so much for your answers.

+4

python character-encoding

user579943 Jan 18 '11 at 14:35

source share

1 answer

Lennart Regebro · Answer 1 · 2011-01-18T15:37:14+0000

Well, this cyrillic line is not in cp-1251. As you seem to know, it was encoded "twice." Most likely, someone took the binary string in cp1251, assuming it is in utf8 and encoded in cp1252, or something like that.

No automatic check can understand this.

 >>> print 'Îêåàí Åëüçè - Ìàéæå âåñíà'.decode('utf8').encode('latin1').decode('cp1251')   -

works. The latter looks like UTF8 because it supports both single and multibyte characters, but not UTF8. so again, an incorrect conversion was done. Performing all possible combinations until one job is likely to be the only opportunity.

Detecting incorrect character encodings using python

More articles: