Work with UTF8

Question

Work with UTF8

It seems like a pretty complicated problem to work with std :: string and UTF8, and I can't find a good explanation for do and dont's.

How can I work correctly with UTF8 in C ++? This is pretty confusing.

I found boost::locale and I set the global locale:

std::locale::global(boost::locale::generator()(""));

However, after that, what do I need to think about when I can get problems? Will writing / reading from a file work properly, comparing strings, etc. ??

So far I know the following:

std::regex / boost::regex will not work, you need covnert for wide lines and use wregex.
boost::algorithm::to_upper will not work, you need to use boost::locale::to_upper

Also, what do I need to know?

+3

c ++ string boost utf locale

ronag Jun 10 '12 at 10:14

source share

1 answer

Matthieu M. · Answer 1 · 2012-06-10T10:42:23+0000

Welcome to the magnificent world of Unicode.

Sorry, wchar_t defined as an implementation, and, as a rule, on Windows it is not enough to store the full code point for Asian scripts (for example)
You can use comparisons to search, but to sort the data and present it to your audience, you need a complete sorting algorithm . Know, for example, that the order in the German dictionary is different from the order in the German phone book (and cry ...)
Generally speaking, I would advise you not to convert the strings yourself. Boost.Locale algorithms should work at all, since they complete the ICU , but otherwise refrain from special operations.
If you split a line into several parts, do not split it in the middle of words. It is too easy to either divide a character into two parts (even using code-knowledge algorithms due to diacritics), or even avoid this by splitting between two characters (because in some cultures some combinations of adjacent characters are treated as one).

Work with UTF8

More articles: