Java reading in character streams with extra unicode characters

Question

Java reading in character streams with extra unicode characters

I'm having trouble reading in extra unicode characters using Java. I have a file that potentially contains characters in an extra set (something more than \ uFFFF). When I set up my InputStreamReader to read a file using UTF-8, I would expect the read () method to return one character for each additional character, instead it seems to break into a 16-bit threshold.

I saw some other questions about basic Unicode character characters, but nothing seems to be dealing with the more than 16-bit case.

Here are some simplified code examples:

InputStreamReader input = new InputStreamReader(file, "UTF8"); int nextChar = input.read(); while(nextChar != -1) { ... nextChar = input.read(); }

Does anyone know what I need to do to correctly read in a UTF-8 encoded file that contains extra characters?

+2

java unicode astral-plane supplementary

wabledoodle Oct 11 '11 at 4:12

source share

2 answers

Although read() is defined as a return int and could theoretically return an additional code point for the all-at-once character, I believe that the type of the return value is just an int to allow -1 to be returned.

The value you get from read() is basically a char by another name, and Java a char limited to 16 bits.

Java can only represent extra characters as a pair of UTF-16 surrogates, there is no such thing as a "single character" (at least in the sense of char ), once you get above 0xFFFF, as far as Java is concerned.

+1

John flatness Oct 11 '11 at 4:26

source share

Chris jester-young · Accepted Answer · 2011-10-11T04:24:49+0000

Java works with UTF-16 . So, if your input stream has astral symbols, they will be displayed as a surrogate pair, i.e. Like two char s. The first symbol is a high surrogate, and the second symbol is a low surrogate.

Java reading in character streams with extra unicode characters

More articles: