ICU code point iteration

My goal is to iterate over the Unicode text character strings by character, but the code below iterates through the code units instead of code points , although I use next32PostInc () , which should iterate over the code points:

void iterate_codepoints(UCharCharacterIterator &it, std::string &str) { UChar32 c; while (it.hasNext()) { c = it.next32PostInc(); str += c; } } void my_test() { const char testChars[] = "\xE6\x96\xAF"; // Chinese character 斯 in UTF-8 UnicodeString testString(testChars, ""); const UChar *testText = testString.getTerminatedBuffer(); UCharCharacterIterator iter(testText, u_strlen(testText)); std::string str; iterate_codepoints(iter, str); std::cout << str; // outputs 斯 in UTF-8 format } int main() { my_test(); return 0; } 

The above code gives the correct result, which is the Chinese character 斯, but there are 3 iterations for this single character, not just 1. Can anyone explain what I'm doing wrong?

In a nutshell, I just want to move characters around in a loop and will be happy to use any ICU iteration classes.

Still trying to solve this problem ...

I also noticed some bad behavior using UnicodeString as shown below. I am using VC ++ 2013.

 void test_02() { // UnicodeString us = "abc 123 ñ"; // results in good UTF-8: 61 62 63 20 31 32 33 20 c3 b1 // UnicodeString us = "斯"; // results in bad UTF-8: 3f // UnicodeString us = "abc 123 ñ 斯"; // results in bad UTF-8: 61 62 63 20 31 32 33 20 c3 b1 20 3f (only the last part '3f' is corrupt) // UnicodeString us = "\xE6\x96\xAF"; // results in bad UTF-8: 00 55 24 04 c4 00 24 // UnicodeString us = "\x61"; // results in good UTF-8: 61 // UnicodeString us = "\x61\x62\x63"; // results in good UTF-8: 61 62 63 // UnicodeString us = "\xC3\xB1"; // results in bad UTF-8: c3 83 c2 b1 UnicodeString us = "ñ"; // results in good UTF-8: c3 b1 std::string cs; us.toUTF8String(cs); std::cout << cs; // output result to file, ie: main >output.txt 

}

I am using VC ++ 2013.

+5
source share
1 answer

Since your source data is UTF-8, you need to point this to a UnicodeString . Its constructor has a codepage parameter for this purpose, but you set it to an empty string:

 UnicodeString testString(testChars, ""); 

This tells UnicodeString to perform an invariant conversion, which is not what you want. As a result, you get 3 code points (U + 00E6 U + 0096 U + 00AF) instead of 1 code point (U + 65AF), so your loop repeats three times.

You need to change the constructor call so that UnicodeString knows that the UTF-8 data, for example:

 UnicodeString testString(testChars, "utf-8"); 
+6
source

Source: https://habr.com/ru/post/1204998/


All Articles