String from byte [] with UTF-8 gives different results on Android than on Windows JVM

I am trying to convert an array of bytes to a string in Java with the following code:

byte[] myArray = {25, -50, -86, 81, 47, 44, 97, -5, 69, -4, 87, -114, -47, 62, -113, -64, 58, -32, -121, -102, 53, -89, -122, 12, -2, -23, -127, 111, -100, 53, -87, -23, -44, -28, 4, -21, -42, 75, 87, -112, -38, 118, 54, 92, -116, 4, -118, 110, -87, 7, -13, 3, -72, -63, -69, 123, 92, 94, 56, 61, 120, -52, 98, -17, 5, 41, 101, -3, 121, 81, -90, 12, -35, -21, -24, 112, -94, 123, 62, 8, 27, 54, 107, -77, 64, 8, -102, -99, -1, 119, 127, 43, 12, -31, -1, 51, -15, 83, -4, -68, -30, 91, -104, 84, 18, -122, -120, 66, 116, -17, -101, -24, 105, -112, -116, -64, -108, 112, -35, 61, 66, 100, 5, -24, -26, -44, 81, -84}; // Bytes from Byte.MIN_VALUE to Byte.MAX_VALUE
String result = new String(myArray, StandardCharsets.UTF_8);

The problem is that I get a different result if I run the code on Windows (JVM 1.8.0_112) than if I ran it on my Android device (tested in android 5.1 and 6.0). Im with a byte array of length 128, in android I get a string of length 120, and in windows I get a string of length 125. I assume this has something to do with the fact that some of the bytes are not valid utf-8 characters, but it’s still strange that I get different results depending on the platform.

If I change the encoding to US-ASCII, I get the same result on both platforms as expected:

String result = new String(myArray, StandardCharsets.US_ASCII);

Edit: Sorry for the confusion. Every time I do not generate it at random. I just mean that bytes do not have a meaningful UTF-8 value. This is an array of bytes that I use for testing:

System.out.println(Arrays.toString(myArray)): [25, -50, -86, 81, 47, 44, 97, -5, 69, -4, 87, -114, -47, 62, -113, -64, 58, -32, -121, -102, 53, -89, -122, 12, -2, -23, -127, 111, -100, 53, -87, -23, -44, -28, 4, -21, -42, 75, 87, -112, -38, 118, 54, 92, -116, 4, -118, 110, -87, 7, -13, 3, -72, -63, -69, 123, 92, 94, 56, 61, 120, -52, 98, -17, 5, 41, 101, -3, 121, 81, -90, 12, -35, -21, -24, 112, -94, 123, 62, 8, 27, 54, 107, -77, 64, 8, -102, -99, -1, 119, 127, 43, 12, -31, -1, 51, -15, 83, -4, -68, -30, 91, -104, 84, 18, -122, -120, 66, 116, -17, -101, -24, 105, -112, -116, -64, -108, 112, -35, 61, 66, 100, 5, -24, -26, -44, 81, -84]

Edit 2: Result of windows:

System.out.println(String(myArray, StandardCharsets.UTF_8)).length: 125
System.out.println(String(myArray, StandardCharsets.UTF_8)): ΪQ/,a E W  >  :   5    o 5      KW  v6\  n     {\^8=x b )e yQ    p {6k    w+  3 S   [ T  Bt  i    p =Bd   Q 
System.out.println(toUnicode(String(myArray, StandardCharsets.UTF_8))): \u0019\u03aa\u0051\u002f\u002c\u0061\ufffd\u0045\ufffd\u0057\ufffd\ufffd\u003e\ufffd\ufffd\u003a\ufffd\ufffd\ufffd\u0035\ufffd\ufffd\u000c\ufffd\ufffd\u006f\ufffd\u0035\ufffd\ufffd\ufffd\ufffd\u0004\ufffd\ufffd\u004b\u0057\ufffd\ufffd\u0076\u0036\u005c\ufffd\u0004\ufffd\u006e\ufffd\u0007\ufffd\u0003\ufffd\ufffd\ufffd\u007b\u005c\u005e\u0038\u003d\u0078\ufffd\u0062\ufffd\u0005\u0029\u0065\ufffd\u0079\u0051\ufffd\u000c\ufffd\ufffd\ufffd\u0070\ufffd\u007b\u003e\u0008\u001b\u0036\u006b\ufffd\u0040\u0008\ufffd\ufffd\ufffd\u0077\u007f\u002b\u000c\ufffd\ufffd\u0033\ufffd\u0053\ufffd\ufffd\ufffd\u005b\ufffd\u0054\u0012\ufffd\ufffd\u0042\u0074\ufffd\ufffd\u0069\ufffd\ufffd\ufffd\ufffd\u0070\ufffd\u003d\u0042\u0064\u0005\ufffd\ufffd\ufffd\u0051\ufffd

Android result:

System.out.println(String(myArray, StandardCharsets.UTF_8)).length: 120
System.out.println(String(myArray, StandardCharsets.UTF_8)): ΪQ/,a E W  >  :ǚ5    o 5      KW  v6\  n   {{\^8=x b )e yQ    p {>6k @   w+ 
System.out.println(toUnicode(String(myArray, StandardCharsets.UTF_8))): \u0019\u03aa\u0051\u002f\u002c\u0061\ufffd\u0045\ufffd\u0057\ufffd\ufffd\u003e\ufffd\ufffd\u003a\u01da\u0035\ufffd\ufffd\u000c\ufffd\ufffd\u006f\ufffd\u0035\ufffd\ufffd\ufffd\ufffd\u0004\ufffd\ufffd\u004b\u0057\ufffd\ufffd\u0076\u0036\u005c\ufffd\u0004\ufffd\u006e\ufffd\u0007\ufffd\u0003\ufffd\u007b\u007b\u005c\u005e\u0038\u003d\u0078\ufffd\u0062\ufffd\u0005\u0029\u0065\ufffd\u0079\u0051\ufffd\u000c\ufffd\ufffd\ufffd\u0070\ufffd\u007b\u003e\u0008\u001b\u0036\u006b\ufffd\u0040\u0008\ufffd\ufffd\ufffd\u0077\u007f\u002b\u000c\ufffd\ufffd\u0033\ufffd\u0053\ufffd\ufffd\u005b\ufffd\u0054\u0012\ufffd\ufffd\u0042\u0074\ufffd\ufffd\u0069\ufffd\ufffd\u0014\u0070\ufffd\u003d\u0042\u0064\u0005\ufffd\ufffd\ufffd\u0051\ufffd

Edit 3: Added correct UTF-16 lines

Edit 4: Modified code for working example

+4
source share
2 answers

Android seems a bit sloppy when interpreting UTF-8 sequences. The relevant part of the standard is to D92in Chapter 3, "Compliance" :

Unicode, 3.1, " " UTF-8 , BMP . , 3-7.

​​ " ", . -32, -121, -102 -63, -69. Android , Java , .

Java , " UTF-8":

byte[][] samples = {
    { -32, -121, -102 },
    { -63, -69 }
};
for(byte[] array: samples) {
    System.out.println("source: "+Arrays.toString(array));
    String string = new String(array, StandardCharsets.UTF_8);
    System.out.println("strictly interpreted: "+string);
    System.out.println("length: "+string.length());
    ByteBuffer bb = ByteBuffer.allocate(array.length+2);
    bb.putShort((short)array.length).put(array);
    ByteArrayInputStream bis = new ByteArrayInputStream(bb.array());
    DataInputStream dis = new DataInputStream(bis);
    string = dis.readUTF();
    System.out.println("sloppily interpreted: "+string);
    System.out.println("length: "+string.length());
    byte[] actual = string.getBytes(StandardCharsets.UTF_8);
    System.out.println("correct sequence: "+Arrays.toString(actual));
    System.out.println();
}

source: [-32, -121, -102]
strictly interpreted:    
length: 3
sloppily interpreted: ǚ
length: 1
correct sequence: [-57, -102]

source: [-63, -69]
strictly interpreted:   
length: 2
sloppily interpreted: {
length: 1
correct sequence: [123]

" " .

+4

. 0xE0 0x87 0x9A. () . ( , ? , , .NET- , , .)

JVM Andriod U+01DA. , , "" , .

+2

Source: https://habr.com/ru/post/1673142/


All Articles