Java 1.6 Windows-1252 Encoding Fails by 3 Characters

EDIT: I was convinced that this question is somewhat insensitive. Thanks to those who answered. I can post the following question, which is more specific.

Today I invested some coding problems and wrote this unit test to isolate the basic playback example:

int badCount = 0; for (int i = 1; i < 255; i++) { String str = "Hi " + new String(new char[] { (char) i }); String toLatin1 = new String(str.getBytes("UTF-8"), "latin1"); assertEquals(str, new String(toLatin1.getBytes("latin1"), "UTF-8")); String toWin1252 = new String(str.getBytes("UTF-8"), "Windows-1252"); String fromWin1252 = new String(toWin1252.getBytes("Windows-1252"), "UTF-8"); if (!str.equals(fromWin1252)) { System.out.println("Can't encode: " + i + " - " + str + " - encodes as: " + fromWin1252); badCount++; } } System.out.println("Bad count: " + badCount); 

Exit:

  Can't encode: 129 - Hi?  - encodes as: Hi ??
     Can't encode: 141 - Hi?  - encodes as: Hi ??
     Can't encode: 143 - Hi?  - encodes as: Hi ??
     Can't encode: 144 - Hi?  - encodes as: Hi ??
     Can't encode: 157 - Hi?  - encodes as: Hi ??
     Can't encode: 193 - Hi Á - encodes as: Hi ??
     Can't encode: 205 - Hi Í - encodes as: Hi ??
     Can't encode: 207 - Hi Ï - encodes as: Hi ??
     Can't encode: 208 - Hi?  - encodes as: Hi ??
     Can't encode: 221 - Hi?  - encodes as: Hi ??
     Bad count: 10

JDK 1.6.0_07 on Mac OS 10.6.2

My observation:

Latin1 symmetrically encodes all 254 characters. Windows-1252 does not. The three printable characters (193, 205, 207) are the same codes in Latin1 and Windows-1252, so I did not expect any problems.

Can anyone explain this behavior? Is this a JDK bug?

- james

+4
source share
2 answers

In my opinion, the testing program is deeply flawed because it makes useless conversions between lines without semantic meaning.

If you want to check if all byte values ​​are valid values ​​for a given encoding, then something like this might be more like:

 public static void tryEncoding(final String encoding) throws UnsupportedEncodingException { int badCount = 0; for (int i = 1; i < 255; i++) { byte[] bytes = new byte[] { (byte) i }; String toString = new String(bytes, encoding); byte[] fromString = toString.getBytes(encoding); if (!Arrays.equals(bytes, fromString)) { System.out.println("Can't encode: " + i + " - in: " + Arrays.toString(bytes) + "/ out: " + Arrays.toString(fromString) + " - result: " + toString); badCount++; } } System.out.println("Bad count: " + badCount); } 

Please note that this test program checks input using (usnigned) byte values from 1 to 255. The code in the question uses char values ​​(equivalent to Unicode code points in this range) from 1 to 255.

Try printing the actual byte arrays processed by the program in the example, and you see that you are not actually checking all byte values ​​and that some of your β€œbad” matches are duplicates of others.

Running this using "Windows-1252" as the argument produces this output:

  Can't encode: 129 - in: [-127] / out: [63] - result:  
 Can't encode: 141 - in: [-115] / out: [63] - result:  
 Can't encode: 143 - in: [-113] / out: [63] - result:  
 Can't encode: 144 - in: [-112] / out: [63] - result:  
 Can't encode: 157 - in: [-99] / out: [63] - result:  
 Bad count: 5

Which tells us that Windows-1252 does not accept byte values ​​129, 1441, 143, 144, and 157 as valid values. (Note: I'm talking about unsigned values ​​here. The above code shows -127, -115, ... because Java only knows unrecognized bytes).

The Wikipedia article on Windows-1252 seems to confirm this observation by stating the following:

According to the Microsoft Consortium and Unicode website information, 81, 8D, 8F, 90, and 9D are not used

+4
source

What your code does ( String->byte[]->String , twice) is pretty much the opposite of transcoding, and makes no sense whatsoever (almost guaranteed to lose data). Transcoding means byte[]->String->byte[] :

 public byte[] transcode(byte[] input, String inputEnc, String targetEnc) { return new String(input, inputEnc).getBytes(targetEnc); } 

And, of course, it will lose data when the input contains characters that the target encoding does not support.

+2
source

Source: https://habr.com/ru/post/1299550/


All Articles