What Java function offsetByCodePoints really takes as an argument?

I am trying to understand some functions of the String class in Java. So here is a simple code:

/* different experiments with String class */ public class TestStrings { public static void main(String[] args) { String greeting = "Hello\uD835\uDD6b"; System.out.println("Number of code units in greeting is " + greeting.length()); System.out.println("Number of code points " + greeting.codePointCount(0,greeting.length())); int index = greeting.offsetByCodePoints(0,6); System.out.println("index = " + index); int cp = greeting.codePointAt(index); System.out.println("Code point at index is " + (char) cp); } } 

\ uD835 \ uDD6b is a โ„ค symbol, so it is normally a surrogate pair.

So, the line has 6 (six) code points and 7 (seven) code units (2-byte characters). Like this in the documentation:

offsetByCodePoints

 public int offsetByCodePoints(int index, int codePointOffset) 

Returns the index inside this string, which is offset from the given index using the code points of the PointOffset code. Unpaired surrogates in the text range specified by the index and codePointOffset are counted as one code point.

Options:

index - the index to be offset

codePointOffset - offset at code points

So, we give an argument at code points. But, given the arguments (0.6), it still works fine, with no exceptions. But crashing for codePointAt (), because it returns 7, which is out of scope. So, maybe the function gets its arguments in code units? Or I missed something.

+4
source share
2 answers

codePointAt takes a char index.

The index refers to char values โ€‹โ€‹(Unicode code units) and ranges from 0 to length() - 1 .

This line contains six code points. The offsetByCodePoints call returns the index after 6 code points, which is char -index 7. Then you try to get codePointAt(7) , which is at the end of the line.

To understand why, think that

 "".offsetByCodePoints(0, 0) == 0 

because to count all 0 code points you need to count past 0 char s.

By extrapolating this to your line, in order to count the past of all 6 codes, you need to count the past of all 7 char s.

Perhaps viewing codePointAt in use will make this clear. This is an idiomatic way to iterate over all code points in a string (or CharSequence ):

 for (var charIndex = 0, nChars = s.length(), codepoint; charIndex < nChars; charIndex += Character.charCount(codepoint)) { codepoint = s.codePointAt(charIndex); // Do something with codepoint. } 
+5
source

Useful answer, Mike ... To easily understand String#offsetByCodePoints , I commented on its use and slightly modified the example question:

I personally find that the Java documentation is mixed.

 public class TestStrings { public static void main(String[] args) { String greeting = "Hello\uD835\uDD6b"; // Gets the `char` index aka offset of the code point // at the code point index `0` starting from the `char` index `6`ยน. // --- // Since `6` refers to an "unpaired" low surrogate (\uDD6b), the // returned value is 6 + 1 = 7. // int charIndex = greeting.offsetByCodePoints(0,6); System.out.println("charIndex = " + charIndex); int cp = greeting.codePointAt(charIndex); System.out.println("Code point at index is " + (char) cp); } } 
0
source

Source: https://habr.com/ru/post/1386525/


All Articles