Java: String.toCharArray () with Unicode characters

I know that char cannot contain Unicode characters (e.g. char c = '\ u1023'). So how will i do

String s = "ABCDEFG\u1023"; char[] c = s.toCharArray(); 

I would like to convert s to CharArray for performance reasons, since I need to scroll through each character in a potentially very long line, which is inefficient. Everything that achieves the same result is beautiful.

Thanks a lot!

EDIT: Actually, char may contain unicode characters. I'm just stupid. Thanks to those who helped anyway.

+6
source share
4 answers

The one who told you that in Java char cannot contain Unicode characters was wrong :

The values โ€‹โ€‹of integral types are integers in the following ranges:

  • For char , from '\u0000' to '\uffff' inclusive, i.e. from 0 to 65535
+8
source

Three things:

  • A char can certainly have u1023.
  • toCharArray() will return a char array, which is almost identical to UTF16
  • Since char is 16 bits and Unicode is 21 bits, characters outside the BMP are encoded as two surrogate characters. For Java 1.5, there is an API for this, for example String.codePointAt(...) . If you are using Java 1.4 or earlier, take a look at ICU4J.
+3
source

Java char may contain the most Unicode characters, as already mentioned, but characters outside the base multilingual plane (BMP) are split into several char , and processing them yourself can lead to line breaking.

To be safe, you can split the string into an array of strings:

 String[] c = s.codePoints() .mapToObj(cp -> new String(Character.toChars(cp))) .toArray(size -> new String[size]); 

... or use isSurrogate , isLowSurrogate and isHighSurrogate for the Character object to prevent a single char from changing inside the pair:

 Character.isSurrogate('a'); 
+1
source

In Java, char is essentially an unsigned short. To iterate over a string containing unicode characters outside the range supported by char (first 65536), you should use the following pattern, which stores each code as an int.

 for (int i = 0; i < str.length();) { int ch = str.codePointAt(i); // do stuff with ch... i += Character.charCount(ch); } 

Java was designed with first-class support for the first 65,536 characters, which at that time was an improvement over C / C ++, which had first-class support for the first 128 or 256 characters only. Unfortunately, this means that the above pattern is needed in Java to support out-of-range characters, which are becoming more common.

0
source

Source: https://habr.com/ru/post/898615/


All Articles