I am trying to convert a string from CP932 (aka Windows-31J) to utf8 in javascript. I basically crawl a site that ignores the utf-8 request in the request header and returns cp932 encoded text (even if the html meta tag indicates that the page is shift_jis).
Anyway, I have the whole page stored in a string variable called "html". From there I try to convert it to utf8 using this code:
var Iconv = require('iconv').Iconv; var conv = new Iconv('CP932', 'UTF-8//TRANSLIT//IGNORE'); var myBuffer = new Buffer(html.length * 3); myBuffer.write(html, 0, 'utf8') var utf8html = (conv.convert(myBuffer)).toString('utf8');
The result is not what it should have been. For example, the line: "投稿 者 さ ん の 稚 内 全日空 ホ テ ル の ク チ コ ミ (感想 · 情報)" goes like "ソ ス ソ ス ソ ス electronic ソ ス メ ゑ ソ ス ソ ス ソ ス ス ス ソス ソ ス t ソ ス ソ ス ソ ス S ソ ス ソ ス ソ ス ソ ス ソ ス g ソ ス e ソ ス ソ ス ソ ス フ ク ソ ス `ソ ス R ソ ス ~ (ソ ス ソ ス ソ ス zソ ス E ソ ス ソ ス ソ ス ソ ス) "
If I delete // TRANSLIT // IGNORE (which should make it return similar characters for missing characters, and if it does not skip characters other than transcoding), I get this error: Error: EILSEQ, Invalid character sequence.
I am open to using any solution that can be implemented in nodejs, but my search results did not give many parameters outside the nodejs-iconv module.
nodejs-iconv ref: https://github.com/bnoordhuis/node-iconv
Thanks!
Edit 06.24.2011: I went ahead and implemented the solution in Java. However, I will still be interested in the javascript solution for this problem, if someone can solve it.
Brian source share