Substring a std :: string in utf-8? C ++ 11

I need to get a substring of the first N characters in std :: string, presumably utf8. I have learned that .substr not working ... as ... expected.

Link: my lines probably look like this: mission: \ n \ n1 ε„„ 2 千万 匹

+6
source share
3 answers

I found this code, and I'm just going to try it.

 std::string utf8_substr(const std::string& str, unsigned int start, unsigned int leng) { if (leng==0) { return ""; } unsigned int c, i, ix, q, min=std::string::npos, max=std::string::npos; for (q=0, i=0, ix=str.length(); i < ix; i++, q++) { if (q==start){ min=i; } if (q<=start+leng || leng==std::string::npos){ max=i; } c = (unsigned char) str[i]; if ( //c>=0 && c<=127) i+=0; else if ((c & 0xE0) == 0xC0) i+=1; else if ((c & 0xF0) == 0xE0) i+=2; else if ((c & 0xF8) == 0xF0) i+=3; //else if (($c & 0xFC) == 0xF8) i+=4; // 111110bb //byte 5, unnecessary in 4 byte UTF-8 //else if (($c & 0xFE) == 0xFC) i+=5; // 1111110b //byte 6, unnecessary in 4 byte UTF-8 else return "";//invalid utf8 } if (q<=start+leng || leng==std::string::npos){ max=i; } if (min==std::string::npos || max==std::string::npos) { return ""; } return str.substr(min,max); } 

Update . This had a good effect on my current problem. I had to mix it with the get-length-of-utf8encoded-stdsstring function.

This solution had some warnings, woven by it by my compiler:

Some warnings spit out by my compiler.

+3
source

You can use boost / locale library to convert utf8 string to wstring. And then use the usual .substr () approach:

 #include <iostream> #include <boost/locale.hpp> std::string ucs4_to_utf8(std::u32string const& in) { return boost::locale::conv::utf_to_utf<char>(in); } std::u32string utf8_to_ucs4(std::string const& in) { return boost::locale::conv::utf_to_utf<char32_t>(in); } int main(){ std::string utf8 = u8"1ε„„2εƒδΈ‡εŒΉ"; std::u32string part = utf8_to_ucs4(utf8).substr(0,3); std::cout<<ucs4_to_utf8(part)<<std::endl; // prints : 1ε„„2 return 0; } 
+1
source

Based on this answer, I wrote my utf8 substring:

 void utf8substr(std::string originalString, int SubStrLength, std::string& csSubstring) { int len = 0, byteIndex = 0; const char* aStr = originalString.c_str(); size_t origSize = originalString.size(); for (byteIndex=0; byteIndex < origSize; byteIndex++) { if((aStr[byteIndex] & 0xc0) != 0x80) len += 1; if(len >= SubStrLength) break; } csSubstring = originalString.substr(0, byteIndex); } 
0
source

Source: https://habr.com/ru/post/989577/


All Articles