I have a collection of .html files created in the mid-90s that include a significant amount of Korean text. HTML does not contain character set metadata, so of course, all Korean text is now not displayed properly. The following examples will use the same piece of text.
In text editors such as Coda and Text Wrangler, text is displayed as
╙╦ ╝№└ ▓╥╕▒ ▓╥╕▒
which, in the absence of character set metadata in <head>, is displayed by the browser as:
ÓË ¼ü¡ïÀŠ ²éÒ, ì¸æ "ì ±" ²éÒ, ì¸æ "ì ±"
Adding euc-kr metadata to <head>
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
It produces the following, which is illegible nonsense (verified by a native speaker):
沓 숩 ∽ 핅 꿴 귥멩 レ 콛 꿴 귥멩 レ 콛
I tried this approach with all the historical Korean character sets, each of which gave similar unsuccessful results. I also tried parsing and upgrading to UTF-8 through Beautiful Soup, which also failed.
Viewing files in Emacs seems promising as it displays text encoding a lower level. The following is the same sample text:
\ 323 \ 313 \ 274 \ 374 \ 241 \ 357 \ 300 \ 212 \ 262 \ 351 \ 322 \ 215 \ 202 \ 354 \ 270 \ 346 \ 253 \ 354 \ 261 \ 224 \ 262 \ 3 \ 51 \ 322 \ 215 \ 202 \ 354 \ 270 \ 346 \ 253 \ 354 \ 261 \ 224
How can I identify this text encoding and promote it in UTF-8?
source share