Convert UTF-8 to ISO-8859-1 in Java

Question

Convert UTF-8 to ISO-8859-1 in Java

I am reading an XML document (UTF-8) and end up showing content on a web page using ISO-8859-1. As expected, multiple characters are not displayed correctly, for example " and (they are displayed as?).

Can I convert these characters from UTF-8 to ISO-8859-1?

Here is the code snippet that I wrote for this:

 BufferedReader br = new BufferedReader(new InputStreamReader(urlConnection.getInputStream(), "UTF-8")); StringBuilder sb = new StringBuilder(); String line = null; while ((line = br.readLine()) != null) { sb.append(line); } br.close(); byte[] latin1 = sb.toString().getBytes("ISO-8859-1"); return new String(latin1);

I'm not quite sure what is going on, but I believe that readLine () causes grief (since the lines will be encoded in Java / UTF-16?). Another option I tried was to replace latin1 with

 byte[] latin1 = new String(sb.toString().getBytes("UTF-8")).getBytes("ISO-8859-1");

I have read previous posts on this subject, and I study when I go. Thanks in advance for your help.

+11

java utf-8 character-encoding iso-8859-1

Chocula Aug 13 '09 at 19:08

source share

4 answers

Depending on the default encoding, the following lines may cause problems,

 byte[] latin1 = sb.toString().getBytes("ISO-8859-1"); return new String(latin1);

In Java, String / Char is always in UTF-16BE. Different encoding is used only when converting characters to bytes. Suppose your default encoding is UTF-8, the latin1 buffer is treated as UTF-8, and some Latin-1 sequence can form an invalid UTF-8 sequence, and you get ?.

+4

ZZ Coder Aug 13 '09 at 19:35

source share

when you initiate your String object, you need to specify which encoding to use.

So replace:

 return new String(latin1);

 return new String(latin1, "ISO-8859-1");

+1

fbaligand Oct 19 '11 at 9:35 a.m.

source share

With Java 8, McDowell's answer can be simplified like this (while maintaining the proper handling of surrogate pairs):

 public final class HtmlEncoder { private HtmlEncoder() { } public static <T extends Appendable> T escapeNonLatin(CharSequence sequence, T out) throws java.io.IOException { for (PrimitiveIterator.OfInt iterator = sequence.codePoints().iterator(); iterator.hasNext(); ) { int codePoint = iterator.nextInt(); if (Character.UnicodeBlock.of(codePoint) == Character.UnicodeBlock.BASIC_LATIN) { out.append((char) codePoint); } else { out.append("&#x"); out.append(Integer.toHexString(codePoint)); out.append(";"); } } return out; } }

+1

robinst May 05 '16 at 1:48

source share

McDowell · Accepted Answer · 2009-08-13 21:53

I'm not sure if there is normalization in the standard library that will do this. I do not think that the conversion of smart quotes is handled by the standard Unicode normalizer , but do not quote me.

The smart thing is to reset ISO-8859-1 and start using UTF-8 . However, you can encode any Unicode codepoint that is normally resolved into an HTML page encoded as ISO-8859-1 . You can encode them using escape sequences as shown below:

 public final class HtmlEncoder { private HtmlEncoder() {} public static <T extends Appendable> T escapeNonLatin(CharSequence sequence, T out) throws java.io.IOException { for (int i = 0; i < sequence.length(); i++) { char ch = sequence.charAt(i); if (Character.UnicodeBlock.of(ch) == Character.UnicodeBlock.BASIC_LATIN) { out.append(ch); } else { int codepoint = Character.codePointAt(sequence, i); // handle supplementary range chars i += Character.charCount(codepoint) - 1; // emit entity out.append("&#x"); out.append(Integer.toHexString(codepoint)); out.append(";"); } } return out; } }

Usage example:

 String foo = "This is Cyrillic Ya: \u044F\n" + "This is fraktur G: \uD835\uDD0A\n" + "This is a smart quote: \u201C"; StringBuilder sb = HtmlEncoder.escapeNonLatin(foo, new StringBuilder()); System.out.println(sb.toString());

Above, the character LEFT DOUBLE QUOTATION MARK ( U+201C “ ) is encoded as & # x201C ;. A couple of other arbitrary code points are likewise encoded.

Care must be taken with this approach. If your text needs to be escaped for HTML, this must be done before the above code or ampersands are ultimately escaped.

Convert UTF-8 to ISO-8859-1 in Java

More articles: