Iterating over a UTF-8 string in C ++ 11

Question

Iterating over a UTF-8 string in C ++ 11

I am trying to iterate through a UTF-8 string. The problem is that UTF-8 characters are of variable length, so I cannot just iterate over char -by-char, but I need to use some kind of conversion. I am sure there is a function for this in modern C ++, but I do not know what it is.

#include <iostream> #include <string> int main() { std::string text = u8"řabcdě"; std::cout << text << std::endl; // Prints fine std::cout << "First letter is: " << text.at(0) << text.at(1) << std::endl; // Again fine. So 'ř' is a 2 byte letter? for(auto it = text.begin(); it < text.end(); it++) { // Obviously wrong. Outputs only ascii part of the text (a, b, c, d) correctly std::cout << "Iterating: " << *it << std::endl; } }

Compiled with clang++ -std=c++11 -stdlib=libc++ test.cpp

From what I read wchar_t and wstring should not be used.

+5

c ++ 11 unicode utf-8

Jan Šimek Sep 27 '14 at 11:19

source share

1 answer

Jan Šimek · Accepted Answer · 2014-09-28T09:57:06+0000

As nm suggested that I used std::wstring_convert :

 #include <codecvt> #include <locale> #include <iostream> #include <string> int main() { std::u32string input = U"řabcdě"; std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter; for(char32_t c : input) { std::cout << converter.to_bytes(c) << std::endl; } }

Perhaps I should clarify in the question that I would like to know if it is possible to do this in C ++ 11 without using any third-party libraries such as ICU or UTF8-CPP.

Iterating over a UTF-8 string in C ++ 11

More articles: