Encode byte [] as a string

Question

Encode byte [] as a string

Heyho,

I want to convert byte data, which can be any, to String. My question is whether encryption of byte data using UTF-8 is “safe”, for example:

String s1 = new String(data, "UTF-8");

or using base64:

 String s2 = Base64.encodeToString(data, false); //migbase64

I'm just afraid that using the first method has negative side effects. I mean that both options work p̶e̶r̶f̶e̶c̶t̶l̶y̶, but s1 can contain any UTF-8 encoding character, s2 uses only "readable" characters. I'm just not sure if base64 really needs to be used. Basically, I just need to create a String, send it over the network, and get it again. (In my situation there is no other way: /)

The question is only about negative side effects , if possible!

+6

java encoding utf-8 byte character-encoding

maxammann Nov 10 '13 at 20:23

source share

3 answers

You can save the byte in a string, although this is not a good idea. You cannot use UTF-8, as this will lead to a change in bytes, but a faster and more efficient way is to use the encoding ISO-8859-1 or simple 8-bit. The easiest way to do this is to use

 String s1 = new String(data, 0);

or

 String s1 = new String(data, "ISO-8859-1");

From WTF-8 on Wikipedia , as John Skeet notes, these encodings are not standard. Their behavior in Java is changing. The DataInputStream treats them as the same for the first three versions, and the next two throw an exception. The Charset decoder treats them as single characters in silence.

 00000000 is \0 11000000 10000000 is \0 11100000 10000000 10000000 is \0 11110000 10000000 10000000 10000000 is \0 11111000 10000000 10000000 10000000 10000000 is \0 11111100 10000000 10000000 10000000 10000000 10000000 is \0

This means that if you see \ 0 in you String, you cannot know exactly what the values of the original byte [] are. DataOutputStream uses the second option for compatibility with C, which sees \ 0 as a terminator.

BTW DataOutputStream does not know about code points, so it writes high-code characters in UTF-16, and then UTF-8 encoding.

0xFE and 0xFF are not valid for display in the character. The values 0x11000000 + can only appear at the beginning of a character, and not inside a multibyte character.

+4

Peter Lawrey Nov 10 '13 at 20:28

source share

Confirmed answer with Java. To repeat UTF-8, UTF-16 does not store all byte values. ISO-8859-1 stores all byte values. But if encoded bytes need to be moved outside the JVM, use Base64.

 @Test public void testBase64() { final byte[] original = enumerate(); final String encoded = Base64.encodeBase64String( original ); final byte[] decoded = Base64.decodeBase64( encoded ); assertTrue( "Base64 preserves bytes", Arrays.equals( original, decoded ) ); } @Test public void testIso8859() { final byte[] original = enumerate(); String s = new String( original, StandardCharsets.ISO_8859_1 ); final byte[] decoded = s.getBytes( StandardCharsets.ISO_8859_1 ); assertTrue( "ISO-8859-1 preserves bytes", Arrays.equals( original, decoded ) ); } @Test public void testUtf16() { final byte[] original = enumerate(); String s = new String( original, StandardCharsets.UTF_16 ); final byte[] decoded = s.getBytes( StandardCharsets.UTF_16 ); assertFalse( "UTF-16 does not preserve bytes", Arrays.equals( original, decoded ) ); } @Test public void testUtf8() { final byte[] original = enumerate(); String s = new String( original, StandardCharsets.UTF_8 ); final byte[] decoded = s.getBytes( StandardCharsets.UTF_8 ); assertFalse( "UTF-8 does not preserve bytes", Arrays.equals( original, decoded ) ); } @Test public void testEnumerate() { final Set<Byte> byteSet = new HashSet<>(); final byte[] bytes = enumerate(); for ( byte b : bytes ) { byteSet.add( b ); } assertEquals( "Expecting 256 distinct values of byte.", 256, byteSet.size() ); } /** * Enumerates all the byte values. */ private byte[] enumerate() { final int length = Byte.MAX_VALUE - Byte.MIN_VALUE + 1; final byte[] bytes = new byte[length]; for ( int i = 0; i < length; i++ ) { bytes[i] = (byte)(i + Byte.MIN_VALUE); } return bytes; }

+2

neurite Nov 19 '15 at 18:15

source share

Jon skeet · Accepted Answer · 2013-11-10T20:26:03+0000

You should absolutely use base64 or possibly hex. (Any of these will work, base64 is more compact, but harder for people to read.)

You claim that "both options work fine," but this is actually not the case. If you use the first approach and data is not really a valid UTF-8 sequence, you will lose data. You are not trying to convert UTF-8 encoded text to String , so don't write code that tries to do this.

Using ISO-8859-1 as the encoding will save all the data, but in very many cases the returned string will not be easily transferred over other protocols. For example, it may contain non-printable control characters.

Use the String(byte[], String) constructor String(byte[], String) when you have text data that you have in encoded form (where the encoding is specified as the second argument). For anything else - music, video, images, encrypted or compressed data, for example, you should use an approach that treats incoming data as "arbitrary binary data" and finds text encoding ... which is basic and hex do.

Encode byte [] as a string

More articles: