Why are there different lengths when converting an array of bytes to String, and then back to an array of bytes?

Question

Why are there different lengths when converting an array of bytes to String, and then back to an array of bytes?

I have the following Java code:

byte[] signatureBytes = getSignature(); String signatureString = new String(signatureBytes, "UTF8"); byte[] signatureStringBytes = signatureString.getBytes("UTF8"); System.out.println(signatureBytes.length == signatureStringBytes.length); // prints false

Q: I probably misunderstand this, but I thought that new String(byte[] bytes, String charset) and String.getBytes(charset) are inverse operations?

Q: As a continuation, what is a safe way to wrap an array of byte [] as a string?

+4

java encoding utf-8 decoding

John Feb 15 '11 at 10:51

source share

3 answers

Perhaps I misunderstand this, but I thought that newlines (bytes [], string characters) and String.getBytes (charset) are inverse operations?

Not necessary.

If the input byte array contains sequences that are not valid UTF-8, then the original conversion can turn them into (for example) question marks. The second operation then turns them into UTF-8 encoded characters '?' .... different from the original.

It is true that some Unicode characters have multiple representations; for example, accented characters can be one code or base character code number and accent. However, converting back and forth between an array of bytes (containing valid UTF-8s) and String should preserve consecutive sequences. He does not perform any “normalization”.

So, what would be a safe way to transfer a byte [] array as a String?

The safest alternative would be base64 to encode an array of bytes. This has the added benefit that characters in String will survive the conversion to any character set / encoding that can be written in letters and numbers.

Another alternative is to use Latin-1 instead of UTF-8. But:

There is a risk of corruption if data is received (for example) erroneously interpreted as UTF-8.
This approach is not legal if the "string" is then embedded in XML. Many control characters are outside the XML character set and cannot be used in an XML document, even encoded as character objects.

+5

Stephen c Feb 15 '11 at 23:01

source share

Two possibilities come to mind.

Firstly, your signature is not entirely true with UTF8. You cannot just take any arbitrary binary data and "string". Not every bit of bits defines a legal nature. The String constructor will add some default replacement content for binary data that does not actually mean anything in UTF8. This is not a reversible process. If you want "String" to use some arbitrary binary data, you need to use the installed method for this, I would suggest org.apache.commons.codec.binary.Base64

There are also some characters that have more than one representation. for example, things with accents can be encoded as an accented symbol or as a symbol plus an accent after that, which must be combined. There is no guarantee that this is a reversible process when moving back and forth between encodings.

+2

Affe Feb 15 '11 at 23:01

source share

maaartinus · Accepted Answer · 2011-02-15T22:56:35+0000

Not all byte[] valid UTF-8. By default, invalid sequences are replaced with a fixed character, and I think the reason for this length is changing.

Try Latin-1, this should not happen, as it is a simple encoding for which every byte[] makes sense.

For Windows-1252, this will not happen. There are undefined sequences (actually undefined bytes), but all characters are received in one byte. The new byte[] may differ from the original, but their length should be the same.

Why are there different lengths when converting an array of bytes to String, and then back to an array of bytes?

More articles: