Substring or characterAt method for UTF8 strings with 2 bytes in JAVA

I am trying to find a substring method or characterAt method that works with a string containing UTF-8 encoded text in JAVA.

Inside JAVA works with UTF-16. This means that String consists of 2-byte characters. The UTF-8 character can be up to 6 bytes in size. When JAVA stores this inside a string, it splits the UTF-8 character into several characters.

For example: The character U + 20000 (UTF-8 Hex: F0 A0 80 80) is stored inside JAVA as a string with two characters (UTF-16 Hex: D840 and DC00).

When you have a string containing a 4-byte UTF-8 character and use the length, the answer will be "2". When you use the substring (0,1), you get the first half of the character.

Some code to illustrate this:

ByteBuffer inputBuffer = ByteBuffer.wrap(new byte[]{(byte)0xF0, (byte)0xA0, (byte)0x80, (byte)0x80}); CharBuffer data = Charset.forName("UTF-8").decode(inputBuffer); String string_test = data.toString(); int length = string_test.length(); String first_half = string_test.substring(0, 1); String second_half = string_test.substring(1, 2); String full_character = string_test.substring(0, 2); 

All this, even if unexpectedly, is not an error, since JAVA works in UTF-16. Good support for UTF-8. But this is not so.

Does JAVA have any class in the library by default, or is there any class that provides UTF-8 support? How in:

  • utf8string.length () - returns 1 if there is one 4-byte character in the file there
  • utf8string.getCharacterAt (0) - returns the first character, not the first half.
  • utf8string.substring (0,1) - returns the first character, not the first half.

Or, what is a widely used solution for this? Convert all UTF-16 characters that support UTF-16 to the default UTF-16 character when reading UTF-8 files? And, as a result, losing all information about characters in a range of code that UTF-16 does not support? This is not necessarily a problem in my specific implementation, so if there is a general way to do this, I would be interested.

+6
source share
2 answers

Does JAVA have any class in the library by default, or is there any class that provides UTF-8 support?

You do not support UTF-8 support. You use Unicode code points (32-bit simple integers), not UTF-16 code units. And yes, Java provides support for this, but it is very difficult to work with.

For example, to get a specific code point, use String.codePointAt - meaning that the index you specify refers to UTF-16 codes, not code points.

To find the length in code points, use String.codePointCount .

To find a substring, you need to find the offset in units of UTF-16 code, and then use the regular substring method; use String.offsetByCodePoints to find the index you want.

Basically look at the String API in all methods containing codePoint .

+7
source

What you need to look for is native Java support for UTF-32. Check out String#*codePoint* methods like codePointAt .

0
source

Source: https://habr.com/ru/post/948933/


All Articles