My goal is to iterate over the Unicode text character strings by character, but the code below iterates through the code units instead of code points , although I use next32PostInc () , which should iterate over the code points:
void iterate_codepoints(UCharCharacterIterator &it, std::string &str) { UChar32 c; while (it.hasNext()) { c = it.next32PostInc(); str += c; } } void my_test() { const char testChars[] = "\xE6\x96\xAF";
The above code gives the correct result, which is the Chinese character 斯, but there are 3 iterations for this single character, not just 1. Can anyone explain what I'm doing wrong?
In a nutshell, I just want to move characters around in a loop and will be happy to use any ICU iteration classes.
Still trying to solve this problem ...
I also noticed some bad behavior using UnicodeString as shown below. I am using VC ++ 2013.
void test_02() { // UnicodeString us = "abc 123 ñ"; // results in good UTF-8: 61 62 63 20 31 32 33 20 c3 b1 // UnicodeString us = "斯"; // results in bad UTF-8: 3f // UnicodeString us = "abc 123 ñ 斯"; // results in bad UTF-8: 61 62 63 20 31 32 33 20 c3 b1 20 3f (only the last part '3f' is corrupt) // UnicodeString us = "\xE6\x96\xAF"; // results in bad UTF-8: 00 55 24 04 c4 00 24 // UnicodeString us = "\x61"; // results in good UTF-8: 61 // UnicodeString us = "\x61\x62\x63"; // results in good UTF-8: 61 62 63 // UnicodeString us = "\xC3\xB1"; // results in bad UTF-8: c3 83 c2 b1 UnicodeString us = "ñ"; // results in good UTF-8: c3 b1 std::string cs; us.toUTF8String(cs); std::cout << cs; // output result to file, ie: main >output.txt
}
I am using VC ++ 2013.
source share