Creating and Using Surrogate Strings

I need to work with code points above 0FFFF (in particular, with mathematical scenarios) and have not found simple guides on how to do this. I want to be able to (a) create String with high code points and (b) iterate over the characters in them. Since char cannot contain these points, my code looks like this:

  @Test public void testSurrogates() throws IOException { // creating a string StringBuffer sb = new StringBuffer(); sb.append("a"); sb.appendCodePoint(120030); sb.append("b"); String s = sb.toString(); System.out.println("s> "+s+" "+s.length()); // iterating over string int codePointCount = s.codePointCount(0, s.length()); Assert.assertEquals(3, codePointCount); int charIndex = 0; for (int i = 0; i < codePointCount; i++) { int codepoint = s.codePointAt(charIndex); int charCount = Character.charCount(codepoint); System.out.println(codepoint+" "+charCount); charIndex += charCount; } } 

It’s not convenient for me that this is either completely correct or the cleanest way to do this. I would expect methods like codePointAfter() , but there is only codePointBefore() . Please confirm that this is the right strategy or alternative.

UPDATE: Thanks for confirming @Jon. I struggled with this - these are two mistakes that should be avoided:

  • there is no direct index at code points (i.e. no s.getCodePoint(i)) - you need to go through them through
  • using (char) since the cast will truncate integers above 0FFFF , and it's not easy to determine
+4
source share
1 answer

It looks right to me. If you want to iterate over code points in a string, you can wrap this code in Iterable :

 public static Iterable<Integer> getCodePoints(final String text) { return new Iterable<Integer>() { @Override public Iterator<Integer> iterator() { return new Iterator<Integer>() { private int nextIndex = 0; @Override public boolean hasNext() { return nextIndex < text.length(); } @Override public Integer next() { if (!hasNext()) { throw new NoSuchElementException(); } int codePoint = text.codePointAt(nextIndex); nextIndex += Character.charCount(codePoint); return codePoint; } @Override public void remove() { throw new UnsupportedOperationException(); } }; } }; } 

Or you can change the method to just return int[] , of course:

 public static int[] getCodePoints(String text) { int[] ret = new int[text.codePointCount(0, text.length())]; int charIndex = 0; for (int i = 0; i < ret.length; i++) { ret[i] = text.codePointAt(charIndex); charIndex += Character.charCount(ret[i]); } return ret; } 

I agree that it is a pity that the Java libraries do not expose similar methods already, but at least they are not so difficult to write.

+5
source

Source: https://habr.com/ru/post/1500000/


All Articles