How to recover text from incorrect encoding?

I have some files created from some Asian OS (Chinese and Japanese XP) the file name is distorted, for example:

DAE + ¾ "Ñ¡Õ䲨ºÏ¼

How can I restore the source text? I tried with this in C #

Encoding unicode = Encoding.Unicode;
Encoding cinese = Encoding.GetEncoding(936);
byte[] chineseBytes = chinese.GetBytes(garbledString);
byte[] unicodeBytes = Encoding.Convert(unicode, chinese, chineseBytes);
//(Then convert byte in string)

and tried changing unicode on windows-1252 but no luck

+3
source share
3 answers
Encoding unicode = Encoding.Unicode;

This is not what you want. Unicode is Microsoft's misleading name for the UTF-16LE encoding. UTF-16LE does not play any role here, you have a simple case when line 936 was incorrectly encoded as 1252.

Windows 1252 , , ISO-8859-1. , , 0x80-0x9F, , 1252, Windows .

Encoding latin= Encoding.getEncoding(1252);
Encoding chinese= Encoding.getEncoding(936);

chinese.getChars(latin.getBytes(s));
+2

. Windows-936, , ISO-8869-1 UTF-8. , Python:

>>> print 'иè+¾«Ñ¡Õ䲨ºÏ¼­'.decode('utf8').encode('latin1').decode('cp936')
新歌+精选珍藏合辑

, - #.

+4

Encoding.Convert . chinese ?

Encoding.Convert(chinese, unicode, chineseBytes);

. , , CP-936 Unicode, . CP-1252, , , .

0
source

Source: https://habr.com/ru/post/1720045/


All Articles