Utf-8, sprintf, strlen etc.

I am trying to understand how to handle the basic operations of UTF-8 in C ++.

Let's say we have this scenario: A user enters a name, it is limited to 10 letters (characters in the user's language, not bytes), and it is saved.

This can be done in ASCII.

// ASCII char * input; // user input char buf[11] // 10 letters + zero snprintf(buf,11,"%s",input); buf[10]=0; int len= strlen(buf); // return 10 (correct) 

Now how to do it in UTF-8? Suppose this is up to 4 bytes (e.g. Chinese).

 // UTF-8 char * input; // user input char buf[41] // 10 letters * 4 bytes + zero snprintf(buf,41,"%s",input); //?? makes no sense, it limits by number of bytes not letters int len= strlen(buf); // return number of bytes not letters (incorrect) 

Can this be done with standard sprintf / strlen? Are there any replacements for this function for use with UTF-8 (in PHP there was a mb_ prefix for such IIRC functions)? If not, do I need to write them myself? Or maybe I need to approach him differently?

Note. I would prefer to avoid solving wide characters ...

EDIT: Limit it to just the base multilingual language.

+5
source share
4 answers

strlen only considers bytes in the input string, until the end of NUL .

On the other hand, it seems to you that you are interested in counting glyphs (what you called characters in the user's language).

The process is complicated by the fact that UTF-8 is a variable-length encoding (to a lesser extent also UTF-16), so code points can be encoded using one to four bytes. And there is also Unicode combining characters to consider.

As far as I know, there is nothing like that in the standard C ++ library. However, you may have been lucky with using third-party libraries like ICU .

+1
source

I would prefer to avoid solving wide characters ...

Wide characters are simply not enough, because if you need 4 bytes for one glyph, then this glyph will most likely be outside the Basic Multilingual Plane , and it will not be represented by a single 16-bit wchar_t character (it is assumed that wchar_t has a width of 16 bits, which is the usual size).

You will need to use a true Unicode library to convert the entry to a list of Unicode characters in their normal form C (canonical composition) or compatibility equivalent (NFKC) (*) depending on whether you want, for example, to count one or two characters for ligatures ff (U + FB00). AFAIK, you have the best ICU .


(*) Unicode allows multiple representations for the same character, in particular the normal shaped form (NFC) and the normal expanded form (NFD). For example, the French character é can be represented in NFC as U + 00E9 or LATIN SMALL LETTER E WITH ACUTE or as U + 0065 U + 0301 or LATIN SMALL LETTER E and then COMBINING ACUTE ACCENT (also displayed as é ).

References and other examples of Unicode equivalence

+1
source

std::strlen really only takes into account one byte character. To calculate the length of a string enclosed in a unicode NUL, you can use std::wcslen .

Example:

 #include <iostream> #include <cwchar> #include <clocale> int main() { const wchar_t* str = L"ηˆ†γœγ‚γƒͺγ‚’γƒ«οΌεΌΎγ‘γ‚γ‚·γƒŠγƒ—γ‚ΉοΌγƒ‘γƒ‹γƒƒγ‚·γƒ₯γƒ‘γƒ³γƒˆγƒ‡γ‚£γ‚Ήγ€γƒ―γƒΌγƒ«γƒ‰οΌ"; std::setlocale(LC_ALL, "en_US.utf8"); std::wcout.imbue(std::locale("en_US.utf8")); std::wcout << "The length of \"" << str << "\" is " << std::wcslen(str) << '\n'; } 
0
source

If you do not want to count utf-8 characters yourself, you can use the temporary conversion to wide format to cut the input line. You do not need to save intermediate values

 #include <iostream> #include <codecvt> #include <string> #include <locale> std::string cutString(const std::string& in, size_t len) { std::wstring_convert<std::codecvt_utf8<wchar_t>> cvt; auto wstring = cvt.from_bytes(in); if(len < wstring.length()) { wstring = wstring.substr(0,len); return cvt.to_bytes(wstring); } return in; } int main(){ std::string test = "δ½ ε₯½δΈ–η•Œι€™ζ˜―ζΌ”η€Ίζ¨£ζœ¬"; std::string res = cutString(test,5); std::cout << test << '\n' << res << '\n'; return 0; } /**************** Output $ ./testδ½ ε₯½δΈ–η•Œι€™ζ˜―ζΌ”η€Ίζ¨£ζœ¬δ½ ε₯½δΈ–η•Œι€™*/ 
0
source

Source: https://habr.com/ru/post/1271963/


All Articles