Decoding Korean text files from the 90s

I have a collection of .html files created in the mid-90s that include a significant amount of Korean text. HTML does not contain character set metadata, so of course, all Korean text is now not displayed properly. The following examples will use the same piece of text.

In text editors such as Coda and Text Wrangler, text is displayed as

╙╦ ╝№└ ▓╥╕▒ ▓╥╕▒

which, in the absence of character set metadata in <head>, is displayed by the browser as:

ÓË ¼ü¡ïÀŠ ²éÒ, ì¸æ "ì ±" ²éÒ, ì¸æ "ì ±"


Adding euc-kr metadata to <head>

<meta http-equiv="Content-Type" content="text/html; charset=euc-kr"> 

It produces the following, which is illegible nonsense (verified by a native speaker):

沓 숩 ∽ 핅 꿴 귥멩 レ 콛 꿴 귥멩 レ 콛


I tried this approach with all the historical Korean character sets, each of which gave similar unsuccessful results. I also tried parsing and upgrading to UTF-8 through Beautiful Soup, which also failed.

Viewing files in Emacs seems promising as it displays text encoding a lower level. The following is the same sample text:

\ 323 \ 313 \ 274 \ 374 \ 241 \ 357 \ 300 \ 212 \ 262 \ 351 \ 322 \ 215 \ 202 \ 354 \ 270 \ 346 \ 253 \ 354 \ 261 \ 224 \ 262 \ 3 \ 51 \ 322 \ 215 \ 202 \ 354 \ 270 \ 346 \ 253 \ 354 \ 261 \ 224


How can I identify this text encoding and promote it in UTF-8?

+6
source share
2 answers

All of these octal codes detected by emacs are less than 254 (or 376 in octal), so it looks like one of those old fonts that were before Unicode that used its own display in ASCII range. If this is correct, you just need to try to figure out what font it is intended for, find it, and possibly do the conversion yourself.

It is a pain. Many years ago, I did something similar for some popular Greek fonts prior to Unicode: http://litot.es/unicode-converter/ (code: https://github.com/seanredmond/Encoding-Converter )

+6
source

In the end, it's finding the right character encoding and using iconv.

 iconv --list 

Displays all available encodings. Grepping for “KR” shows that at least my system can use CSEUCKR, CSISO2022KR, EUC-KR, ISO-2022-KR and ISO646-KR. Korean is also BIG5HKSCS, CSKSC5636 and KSC5636 according to Wikipedia. Try them all until something reasonable appears.

+1
source

Source: https://habr.com/ru/post/918291/


All Articles