Reading EUC HTML using Java on Windows

Question

Reading EUC HTML using Java on Windows

I am trying to read an HTML file that is encoded in EUC-KR from a URL. When I compile the code inside the IDE, I get the desired output, but when I create a jar and try to start the jar, the read data is displayed as question marks ("????" instead of Korean characters). I assume this is due to loss of encoding.

The site meta says the following:

 <meta http-equiv="Content-Type" content="text/html; charset=euc-kr">

Here is my code:

  String line;
  URL u = new URL("link to the site");
  InputStream in = u.openConnection().getInputStream();
  BufferedReader r = new BufferedReader(new InputStreamReader(in, "EUC-KR"));
  while ((line = r.readLine()) != null) {
    /*send the string to a text area*/--> This works fine now
    /*take the string and pass it thru ByteArrayInputStream*/ --> this is where I believe the encoding is lost.

    InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
    Reader reader = new InputStreamReader(xin);
    EditorKit kit = new HTMLEditorKit();
    HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
    kit.read(reader, doc, 0);
    HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.STRONG);

    while (it.isValid()) {
      chaps.add(doc.getText(it.getStartOffset(), it.getEndOffset() - it.getStartOffset()).trim());
      //chaps is a arraylist<string>
      it.next();
    }

I would be grateful if someone would help me figure out how to capture characters without losing encoding when launching the application on any platform independent of the default system encoding.

thank

PS: the program, when launched as a jar, shows system coding as Cp1252 and UTF-8 when launched inside the IDE.

+3

java character-encoding bufferedreader

Monk 16 . '11 5:28

1

McDowell · Accepted Answer · 2011-01-16T11:20:55+0000

InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
Reader reader = new InputStreamReader(xin);

. "EUC-KR" ( ). , InputStreamReader.

StringReader.

Reading EUC HTML using Java on Windows

More articles: