How to recover text from incorrect encoding?

Question

How to recover text from incorrect encoding?

I have some files created from some Asian OS (Chinese and Japanese XP) the file name is distorted, for example:

DAE + ¾ "Ñ¡Õä²ØºÏ¼

How can I restore the source text? I tried with this in C #

Encoding unicode = Encoding.Unicode;
Encoding cinese = Encoding.GetEncoding(936);
byte[] chineseBytes = chinese.GetBytes(garbledString);
byte[] unicodeBytes = Encoding.Convert(unicode, chinese, chineseBytes);
//(Then convert byte in string)

and tried changing unicode on windows-1252 but no luck

+3

encoding character-encoding

Magnetic_dud Oct 14 '09 at 6:39

source share

3 answers

. Windows-936, , ISO-8869-1 UTF-8. , Python:

>>> print 'ÐÂ¸è+¾«Ñ¡Õä²ØºÏ¼'.decode('utf8').encode('latin1').decode('cp936')
新歌+精选珍藏合辑

, - #.

+4

Lukáš Lalinský 14 . '09 6:50

Encoding.Convert . chinese ?

Encoding.Convert(chinese, unicode, chineseBytes);

. , , CP-936 Unicode, . CP-1252, , , .

0

Joey 14 . '09 6:45

source share

bobince · Accepted Answer · 2009-10-14T08:44:24+0000

Encoding unicode = Encoding.Unicode;

This is not what you want. Unicode is Microsoft's misleading name for the UTF-16LE encoding. UTF-16LE does not play any role here, you have a simple case when line 936 was incorrectly encoded as 1252.

Windows 1252 , , ISO-8859-1. , , 0x80-0x9F, , 1252, Windows .

Encoding latin= Encoding.getEncoding(1252);
Encoding chinese= Encoding.getEncoding(936);

chinese.getChars(latin.getBytes(s));

How to recover text from incorrect encoding?

More articles: