UTF-8 substring for Latin, Chinese, Cyrillic, etc.

On Windows Phone, I want to fine-tune any given string to the equivalent of 100 ASCII characters in length.

String.Length is clearly useless since the Chinese string uses 3 bytes per character, the Danish string uses 2 or 4 bytes per character, and the Russian string uses 4 bytes per character.

The only encoding available is UTF-8 and UTF-16. So what should I do?

The idea is this:

private static string UnicodeSubstring(string text, int length) { var bytes = Encoding.UTF8.GetBytes(text); return Encoding.UTF8.GetString(bytes, 0, Math.Min(bytes.Length, length)); } 

But the length must be correctly divided by the number of bytes used for each character, so the last character is always displayed correctly.

+4
source share
3 answers

One option is to simply go through the string, calculating the number of bytes for each character.

If you know that you do not need to deal with characters outside the BMP, this is simple enough:

 public string SubstringWithinUtf8Limit(string text, int byteLimit) { int byteCount = 0; char[] buffer = new char[1]; for (int i = 0; i < text.Length; i++) { buffer[0] = text[i]; byteCount += Encoding.UTF8.GetByteCount(buffer); if (byteCount > byteLimit) { // Couldn't add this character. Return its index return text.Substring(0, i); } } return text; } 

This gets a little trickier if you have to handle surrogate pairs :(

+6
source

One option is to simply add the “characters” (including surrogate pairs if you need to support them) to the resulting string and see if it converts to the correct amount of what you want.

+1
source

The idea is also to check if the last character is a Unicode Replace Character and removes one character until it is displayed correctly.

 private static string UnicodeSubstring(string text, int length) { var bytes = Encoding.UTF8.GetBytes(text); var result = Encoding.UTF8.GetString(bytes, 0, Math.Min(bytes.Length, length)); while ('\uFFFD' == result[result.Length - 1]) { result = result.Substring(0, result.Length - 1); } return result; } 
0
source

Source: https://habr.com/ru/post/1434043/


All Articles