I am trying to read an HTML file that is encoded in EUC-KR from a URL. When I compile the code inside the IDE, I get the desired output, but when I create a jar and try to start the jar, the read data is displayed as question marks ("????" instead of Korean characters). I assume this is due to loss of encoding.
The site meta says the following:
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
Here is my code:
String line;
URL u = new URL("link to the site");
InputStream in = u.openConnection().getInputStream();
BufferedReader r = new BufferedReader(new InputStreamReader(in, "EUC-KR"));
while ((line = r.readLine()) != null) {
--> This works fine now
--> this is where I believe the encoding is lost.
InputStream xin = new ByteArrayInputStream(thestring.getBytes("EUC-KR"));
Reader reader = new InputStreamReader(xin);
EditorKit kit = new HTMLEditorKit();
HTMLDocument doc = (HTMLDocument) kit.createDefaultDocument();
kit.read(reader, doc, 0);
HTMLDocument.Iterator it = doc.getIterator(HTML.Tag.STRONG);
while (it.isValid()) {
chaps.add(doc.getText(it.getStartOffset(), it.getEndOffset() - it.getStartOffset()).trim());
it.next();
}
I would be grateful if someone would help me figure out how to capture characters without losing encoding when launching the application on any platform independent of the default system encoding.
thank
PS: the program, when launched as a jar, shows system coding as Cp1252 and UTF-8 when launched inside the IDE.