Iterating over a UTF-8 string in C ++ 11

I am trying to iterate through a UTF-8 string. The problem is that UTF-8 characters are of variable length, so I cannot just iterate over char -by-char, but I need to use some kind of conversion. I am sure there is a function for this in modern C ++, but I do not know what it is.

#include <iostream> #include <string> int main() { std::string text = u8"řabcdě"; std::cout << text << std::endl; // Prints fine std::cout << "First letter is: " << text.at(0) << text.at(1) << std::endl; // Again fine. So 'ř' is a 2 byte letter? for(auto it = text.begin(); it < text.end(); it++) { // Obviously wrong. Outputs only ascii part of the text (a, b, c, d) correctly std::cout << "Iterating: " << *it << std::endl; } } 

Compiled with clang++ -std=c++11 -stdlib=libc++ test.cpp

From what I read wchar_t and wstring should not be used.

+5
source share
1 answer

As nm suggested that I used std::wstring_convert :

 #include <codecvt> #include <locale> #include <iostream> #include <string> int main() { std::u32string input = U"řabcdě"; std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter; for(char32_t c : input) { std::cout << converter.to_bytes(c) << std::endl; } } 

Perhaps I should clarify in the question that I would like to know if it is possible to do this in C ++ 11 without using any third-party libraries such as ICU or UTF8-CPP.

+3
source

Source: https://habr.com/ru/post/1203537/


All Articles