Std :: string and encoded Unicode UTF-8

If I understand well, you can use both string and wstring to store UTF-8 text.

  • With char, ASCII characters accept one byte, some Chinese characters accept 3 or 4, etc. This means that str[3] does not necessarily indicate the 4th character.

  • With wchar_t the same, but the minimum number of bytes used for each character is always 2 (instead of 1 for char ), and a character 3 or 4 bytes wide will take 2 wchar_t .

Right?

So what if I want to use string::find_first_of() or string::compare() , etc. with such a strange encoding? Will this work? Does the string class contain the fact that characters are of variable size? Or should I use them only as dummy byte arrays with fewer elements, in which case I would prefer to use the wchar_t[] buffer.

If std::string does not cope with this, the second question is: are there libraries providing string classes that can handle this UTF-8 encoding so that str[3] actually points to the third character (which would be a byte array from length 1 to 4)?

+6
source share
3 answers

You are talking about Unicode. Unicode uses 32 bits to represent a character. However, since this will lose memory, there are more compact encodings. UTF-8 is one such encoding. It is assumed that you use byte units and maps Unicode characters to 1, 2, 3, or 4 bytes. UTF-16 is another one that uses words as units and maps Unicode characters to 1 or 2 words (2 or 4 bytes). You can use both encodings with both string and wchar_t. UTF-8 tends to be more compact for English text / numbers.

Some things will work regardless of the encoding used and type (cf.). However, all functions that must understand one character will be violated. I. The fifth character is not always the fifth line in the base array. It may seem like it works with certain examples, but it will eventually break. string :: compare will work, but don't expect to get an alphabetical order. It depends on the language. string :: find_first_of will work for some, but not for all. Long strings are likely to work just because they are long, and shorter strings can be confused by the alignment of characters and it is very difficult to find errors.

It’s best to find a library that processes it for you, and ignore the type below it (unless you have good reason to choose one or the other).

+5
source

You cannot handle Unicode with std :: string or any other tools from the standard library. Use an external library, for example: http://utfcpp.sourceforge.net/

+2
source

You are right for those:
... This means that str [3] does not necessarily point to the 4th character ... use them only as dummy byte arrays with smaller bytes ...

C ++ string can only process ascii characters. This is different from a Java string that can handle Unicode characters. You can save the encoding result (bytes) of Chinese characters to a string (char in C / C ++ is just a byte), but this is pointless, since a string simply treats bytes as ascii characters, so you cannot use string functions to process it,
wstring may be something you need.

There is something to clarify. UTF-8 is just a Unicode character encoding method (converting characters from / to byte format).

-1
source

Source: https://habr.com/ru/post/953379/


All Articles