Java reading in character streams with extra unicode characters

I'm having trouble reading in extra unicode characters using Java. I have a file that potentially contains characters in an extra set (something more than \ uFFFF). When I set up my InputStreamReader to read a file using UTF-8, I would expect the read () method to return one character for each additional character, instead it seems to break into a 16-bit threshold.

I saw some other questions about basic Unicode character characters, but nothing seems to be dealing with the more than 16-bit case.

Here are some simplified code examples:

InputStreamReader input = new InputStreamReader(file, "UTF8"); int nextChar = input.read(); while(nextChar != -1) { ... nextChar = input.read(); } 

Does anyone know what I need to do to correctly read in a UTF-8 encoded file that contains extra characters?

+2
source share
2 answers

Java works with UTF-16 . So, if your input stream has astral symbols, they will be displayed as a surrogate pair, i.e. Like two char s. The first symbol is a high surrogate, and the second symbol is a low surrogate.

+4
source

Although read() is defined as a return int and could theoretically return an additional code point for the all-at-once character, I believe that the type of the return value is just an int to allow -1 to be returned.

The value you get from read() is basically a char by another name, and Java a char limited to 16 bits.

Java can only represent extra characters as a pair of UTF-16 surrogates, there is no such thing as a "single character" (at least in the sense of char ), once you get above 0xFFFF, as far as Java is concerned.

+1
source

Source: https://habr.com/ru/post/948305/


All Articles