Java 1.6 Windows-1252 Encoding Fails by 3 Characters

Question

Java 1.6 Windows-1252 Encoding Fails by 3 Characters

EDIT: I was convinced that this question is somewhat insensitive. Thanks to those who answered. I can post the following question, which is more specific.

Today I invested some coding problems and wrote this unit test to isolate the basic playback example:

int badCount = 0; for (int i = 1; i < 255; i++) { String str = "Hi " + new String(new char[] { (char) i }); String toLatin1 = new String(str.getBytes("UTF-8"), "latin1"); assertEquals(str, new String(toLatin1.getBytes("latin1"), "UTF-8")); String toWin1252 = new String(str.getBytes("UTF-8"), "Windows-1252"); String fromWin1252 = new String(toWin1252.getBytes("Windows-1252"), "UTF-8"); if (!str.equals(fromWin1252)) { System.out.println("Can't encode: " + i + " - " + str + " - encodes as: " + fromWin1252); badCount++; } } System.out.println("Bad count: " + badCount);

Exit:

  Can't encode: 129 - Hi?  - encodes as: Hi ??
     Can't encode: 141 - Hi?  - encodes as: Hi ??
     Can't encode: 143 - Hi?  - encodes as: Hi ??
     Can't encode: 144 - Hi?  - encodes as: Hi ??
     Can't encode: 157 - Hi?  - encodes as: Hi ??
     Can't encode: 193 - Hi Á - encodes as: Hi ??
     Can't encode: 205 - Hi Í - encodes as: Hi ??
     Can't encode: 207 - Hi Ï - encodes as: Hi ??
     Can't encode: 208 - Hi?  - encodes as: Hi ??
     Can't encode: 221 - Hi?  - encodes as: Hi ??
     Bad count: 10

JDK 1.6.0_07 on Mac OS 10.6.2

My observation:

Latin1 symmetrically encodes all 254 characters. Windows-1252 does not. The three printable characters (193, 205, 207) are the same codes in Latin1 and Windows-1252, so I did not expect any problems.

Can anyone explain this behavior? Is this a JDK bug?

- james

+4

java codepages windows-1252

James cooper Jan 27 '10 at 14:58

source share

2 answers

What your code does ( String->byte[]->String , twice) is pretty much the opposite of transcoding, and makes no sense whatsoever (almost guaranteed to lose data). Transcoding means byte[]->String->byte[] :

 public byte[] transcode(byte[] input, String inputEnc, String targetEnc) { return new String(input, inputEnc).getBytes(targetEnc); }

And, of course, it will lose data when the input contains characters that the target encoding does not support.

+2

Michael borgwardt Jan 27 '10 at 15:25

source share

Joachim sauer · Accepted Answer · 2010-01-27T15:12:59+0000

In my opinion, the testing program is deeply flawed because it makes useless conversions between lines without semantic meaning.

If you want to check if all byte values are valid values for a given encoding, then something like this might be more like:

 public static void tryEncoding(final String encoding) throws UnsupportedEncodingException { int badCount = 0; for (int i = 1; i < 255; i++) { byte[] bytes = new byte[] { (byte) i }; String toString = new String(bytes, encoding); byte[] fromString = toString.getBytes(encoding); if (!Arrays.equals(bytes, fromString)) { System.out.println("Can't encode: " + i + " - in: " + Arrays.toString(bytes) + "/ out: " + Arrays.toString(fromString) + " - result: " + toString); badCount++; } } System.out.println("Bad count: " + badCount); }

Please note that this test program checks input using (usnigned) byte values from 1 to 255. The code in the question uses char values (equivalent to Unicode code points in this range) from 1 to 255.

Try printing the actual byte arrays processed by the program in the example, and you see that you are not actually checking all byte values and that some of your “bad” matches are duplicates of others.

Running this using "Windows-1252" as the argument produces this output:

  Can't encode: 129 - in: [-127] / out: [63] - result:  
 Can't encode: 141 - in: [-115] / out: [63] - result:  
 Can't encode: 143 - in: [-113] / out: [63] - result:  
 Can't encode: 144 - in: [-112] / out: [63] - result:  
 Can't encode: 157 - in: [-99] / out: [63] - result:  
 Bad count: 5

Which tells us that Windows-1252 does not accept byte values 129, 1441, 143, 144, and 157 as valid values. (Note: I'm talking about unsigned values here. The above code shows -127, -115, ... because Java only knows unrecognized bytes).

The Wikipedia article on Windows-1252 seems to confirm this observation by stating the following:

According to the Microsoft Consortium and Unicode website information, 81, 8D, 8F, 90, and 9D are not used

Java 1.6 Windows-1252 Encoding Fails by 3 Characters

More articles: