Work with UTF8

It seems like a pretty complicated problem to work with std :: string and UTF8, and I can't find a good explanation for do and dont's.

How can I work correctly with UTF8 in C ++? This is pretty confusing.

I found boost::locale and I set the global locale:

std::locale::global(boost::locale::generator()(""));

However, after that, what do I need to think about when I can get problems? Will writing / reading from a file work properly, comparing strings, etc. ??

So far I know the following:

  • std::regex / boost::regex will not work, you need covnert for wide lines and use wregex.
  • boost::algorithm::to_upper will not work, you need to use boost::locale::to_upper

Also, what do I need to know?

+3
source share
1 answer

Welcome to the magnificent world of Unicode.

  • Sorry, wchar_t defined as an implementation, and, as a rule, on Windows it is not enough to store the full code point for Asian scripts (for example)
  • You can use comparisons to search, but to sort the data and present it to your audience, you need a complete sorting algorithm . Know, for example, that the order in the German dictionary is different from the order in the German phone book (and cry ...)
  • Generally speaking, I would advise you not to convert the strings yourself. Boost.Locale algorithms should work at all, since they complete the ICU , but otherwise refrain from special operations.
  • If you split a line into several parts, do not split it in the middle of words. It is too easy to either divide a character into two parts (even using code-knowledge algorithms due to diacritics), or even avoid this by splitting between two characters (because in some cultures some combinations of adjacent characters are treated as one).
+2
source

Source: https://habr.com/ru/post/1468851/


All Articles