Will String.getBytes ("UTF-16") return the same result on all platforms?

I need to create a hash from a String containing the user password. To create a hash, I use a byte array that I get by calling String.getBytes() . But when I call this method with the specified encoding (for example, UTF-8) on a platform where it is not the default encoding, non-ASCII characters are replaced with the default character (if I understand the behavior of getBytes () correctly), and therefore on in such a platform, I will get a different byte array and, ultimately, a different hash.

Since strings are internally stored in UTF-16, will String.getBytes("UTF-16") ensure that I get the same byte array on each platform, regardless of its default encoding?

+5
source share
3 answers

Yes. Not only is it guaranteed to be UTF-16, but the byte order is given :

When decoding, UTF-16 encoding interprets the byte byte sign at the beginning of the input stream to indicate the byte order of the stream, but defaults to big-endian if there is no byte byte; when encoding, it uses the byte order of bytes and writes the byte sign of the large end.

(The specification does not matter if the caller does not request it, so String.getBytes(...) will not include it.)

As long as you have the same string content, that is, the same sequence of char values, then you will get the same bytes in each Java implementation, preventing errors. (Any such error would be quite surprising, given that UTF-16 is probably the easiest encoding to implement in Java ...)

The fact that UTF-16 is a native representation for char (and usually for String ) makes sense only in terms of ease of implementation. For example, I would also expect String.getBytes("UTF-8") give the same results on every platform.

+4
source

That's true, java uses Unicode internally, so it can combine any script / language. String and char use UTF-16BE, but .class files store string constants in UTF-8 there. In general, it doesn't matter what String does, since there is a conversion to bytes that defines the encoding in which the bytes should be.

If this byte encoding cannot represent some Unicode characters, a placeholder or question mark is indicated. Furthermore, fonts may not have all Unicode characters, 35 MB for a full Unicode font is a normal size. Then you can see a square with 2x2 hex codes or so for missing code points. Or on Linux, another font can replace char.

Consequently, the UTF-8 is a great fine choice.

 String s = ...; if (!s.startsWith("\uFEFF")) { // Add a Unicode BOM s = "\uFEFF" + s; } byte[] bytes = s.getBytes(StandardCharsets.UTF_8); 

Both UTF-16 (both in byte orders) and UTF-8 are always present in the JRE, while some encodings are not. Therefore, you can use a constant from StandardCharsets that does not require handling any UnsupportedEncodingException.

Above, I added a specification for Windows Notepad esoecially to recognize UTF-8. This is definitely not good. But here is a little help.

There is no shortage of UTF16-LE or UTF-16BE. I think UTF-8 is more universally used, since UTF-16 also cannot store all Unicode code points in 16 bits. The text of Asian scripts will be more concise, but already HTML pages are more compact in UTF-8 due to HTML tags and other Latin scripts.

For Windows, UTF-16LE may be more native.

Placeholder problems may occur for platforms other than Unicode, especially Windows.

+1
source

I just found this:

https://github.com/facebook/conceal/issues/138

which seems to answer negatively to your question.

According to John Skeet: the specification is clear. But I assume that the Android / Mac implementation of Dalvik / JVM does not agree.

0
source

Source: https://habr.com/ru/post/1202708/


All Articles