ICU code point iteration

Question

ICU code point iteration

My goal is to iterate over the Unicode text character strings by character, but the code below iterates through the code units instead of code points , although I use next32PostInc () , which should iterate over the code points:

void iterate_codepoints(UCharCharacterIterator &it, std::string &str) { UChar32 c; while (it.hasNext()) { c = it.next32PostInc(); str += c; } } void my_test() { const char testChars[] = "\xE6\x96\xAF"; // Chinese character 斯 in UTF-8 UnicodeString testString(testChars, ""); const UChar *testText = testString.getTerminatedBuffer(); UCharCharacterIterator iter(testText, u_strlen(testText)); std::string str; iterate_codepoints(iter, str); std::cout << str; // outputs 斯 in UTF-8 format } int main() { my_test(); return 0; }

The above code gives the correct result, which is the Chinese character 斯, but there are 3 iterations for this single character, not just 1. Can anyone explain what I'm doing wrong?

In a nutshell, I just want to move characters around in a loop and will be happy to use any ICU iteration classes.

Still trying to solve this problem ...

I also noticed some bad behavior using UnicodeString as shown below. I am using VC ++ 2013.

 void test_02() { // UnicodeString us = "abc 123 ñ"; // results in good UTF-8: 61 62 63 20 31 32 33 20 c3 b1 // UnicodeString us = "斯"; // results in bad UTF-8: 3f // UnicodeString us = "abc 123 ñ 斯"; // results in bad UTF-8: 61 62 63 20 31 32 33 20 c3 b1 20 3f (only the last part '3f' is corrupt) // UnicodeString us = "\xE6\x96\xAF"; // results in bad UTF-8: 00 55 24 04 c4 00 24 // UnicodeString us = "\x61"; // results in good UTF-8: 61 // UnicodeString us = "\x61\x62\x63"; // results in good UTF-8: 61 62 63 // UnicodeString us = "\xC3\xB1"; // results in bad UTF-8: c3 83 c2 b1 UnicodeString us = "ñ"; // results in good UTF-8: c3 b1 std::string cs; us.toUTF8String(cs); std::cout << cs; // output result to file, ie: main >output.txt

}

I am using VC ++ 2013.

+5

c ++ icu

Caroline beltran Oct 19 '14 at 2:53

source share

1 answer

Remy lebeau · Accepted Answer · 2014-10-20T23:05:43+0000

Since your source data is UTF-8, you need to point this to a UnicodeString . Its constructor has a codepage parameter for this purpose, but you set it to an empty string:

 UnicodeString testString(testChars, "");

This tells UnicodeString to perform an invariant conversion, which is not what you want. As a result, you get 3 code points (U + 00E6 U + 0096 U + 00AF) instead of 1 code point (U + 65AF), so your loop repeats three times.

You need to change the constructor call so that UnicodeString knows that the UTF-8 data, for example:

 UnicodeString testString(testChars, "utf-8");

ICU code point iteration

More articles: