Perhaps I misunderstand this, but I thought that newlines (bytes [], string characters) and String.getBytes (charset) are inverse operations?
Not necessary.
If the input byte array contains sequences that are not valid UTF-8, then the original conversion can turn them into (for example) question marks. The second operation then turns them into UTF-8 encoded characters '?' .... different from the original.
It is true that some Unicode characters have multiple representations; for example, accented characters can be one code or base character code number and accent. However, converting back and forth between an array of bytes (containing valid UTF-8s) and String should preserve consecutive sequences. He does not perform any “normalization”.
So, what would be a safe way to transfer a byte [] array as a String?
The safest alternative would be base64 to encode an array of bytes. This has the added benefit that characters in String will survive the conversion to any character set / encoding that can be written in letters and numbers.
Another alternative is to use Latin-1 instead of UTF-8. But:
- There is a risk of corruption if data is received (for example) erroneously interpreted as UTF-8.
- This approach is not legal if the "string" is then embedded in XML. Many control characters are outside the XML character set and cannot be used in an XML document, even encoded as character objects.
source share