UTF-8 string sorting?

My std :: strings are encoded in UTF-8, so the std :: string <statement does not cut. How can I compare 2 utf-8 encoded std :: strings?

where he is not cut for accents, รฉ appears after z, which he should not

thanks

+4
source share
4 answers

If you do not want lexicographic ordering (this is what gives you the lexicographic encoding of UTF-8 encoded strings), you will need to decode your UTF-8 encoded strings into UCS-2 or UCS-4 as the quality and use the appropriate comparison function of your choice.

To repeat this point, the UTF-8 encoding mechanism is thoughtfully designed so that if you sort by looking at the numerical value of each 8-bit encoded byte, you will get the same result as if you first decoded a string in Unicode and compared the numerical values โ€‹โ€‹of each code point.

Update: Your updated question indicates that you need a more sophisticated comparison function than just lexicographic sorting. You will need to decode the UTF-8 strings and compare the decoded characters.

+4
source

The standard has std::locale for std::locale -specific things like sorting (sorting). If the environment contains LC_COLLATE=en_US.utf8 or similar, this program will sort the lines as desired.

 #include <algorithm> #include <functional> #include <iostream> #include <iterator> #include <locale> #include <string> #include <vector> class collate_in : public std::binary_function<std::string, std::string, bool> { protected: const std::collate<char> &coll; public: collate_in(std::locale loc) : coll(std::use_facet<std::collate<char> >(loc)) {} bool operator()(const std::string &a, const std::string &b) const { // std::collate::compare() takes C-style string (begin, end)s and // returns values like strcmp or strcoll. Compare to 0 for results // expected for a less<>-style comparator. return coll.compare(a.c_str(), a.c_str() + a.size(), b.c_str(), b.c_str() + b.size()) < 0; } }; int main() { std::vector<std::string> v; copy(std::istream_iterator<std::string>(std::cin), std::istream_iterator<std::string>(), back_inserter(v)); // std::locale("") is the locale from the environment. One could also // std::locale::global(std::locale("")) to set up this program global // first, and then use locale() to get the global locale, or choose a // specific locale instead of using the environment's. sort(v.begin(), v.end(), collate_in(std::locale(""))); copy(v.begin(), v.end(), std::ostream_iterator<std::string>(std::cout, "\n")); return 0; } 
  $ cat> file
 f
 รฉ
 e
 d
 ^ D
 $ LC_COLLATE = C ./a.out file
 d
 e
 f
 รฉ
 $ LC_COLLATE = en_US.utf8 ./a.out file
 d
 e
 รฉ
 f

I was told that std::locale::operator()(a, b) exists, breaking the wrapper std::collate<>::compare(a, b) < 0 that I wrote above.

 #include <algorithm> #include <iostream> #include <iterator> #include <locale> #include <string> #include <vector> int main() { std::vector<std::string> v; copy(std::istream_iterator<std::string>(std::cin), std::istream_iterator<std::string>(), back_inserter(v)); sort(v.begin(), v.end(), std::locale("")); copy(v.begin(), v.end(), std::ostream_iterator<std::string>(std::cout, "\n")); return 0; } 
+6
source

Encoding (UTF-8, 16, etc.) is not a problem, regardless of whether it treats this string itself as a Unicode string or an 8-bit (ASCII or Latin-1) string.

I found Is there a friendly C ++ Wrapper for STL and UTF-8 for ICU or another powerful Unicode library that can help you.

0
source

One option is to use ICU colliders ( http://userguide.icu-project.org/collation/api ), which provide a properly internationalized comparison method that can then be used for sorting.

Chromium has a small wrapper that needs to be easily copied and pasted / reused

https://code.google.com/p/chromium/codesearch#chromium/src/base/i18n/string_compare.cc&sq=package:chromium&type=cs

0
source

Source: https://habr.com/ru/post/1334452/


All Articles