Multibyte C ++ Substring Characters

I have this std :: string that contains some characters that span multiple bytes.

When I make a substring in this line, the output is invalid because, of course, these characters are considered 2 characters. In my opinion, I should use wstring instead, because it will save these characters as one element, not more.

So, I decided to copy the string to wstring, but of course that doesn't make sense, because the characters remain divided by 2 characters. This only exacerbates the situation.

Is there a good solution when converting a string to wstring, merging special characters into 1 element instead of 2.

thanks

+3
source share
6 answers

There are really only two possible solutions. If you do this a lot, over long distances, you would be better off converting your characters to encode a single element using wchar_t (or int32_t , or whatever is most appropriate. This is not a simple copy that will convert every single char to a target type, but true conversion function that recognizes multibyte characters and convert them to a single element.

For random use or shorter sequences, you can write your own functions to advance n bytes. For UTF-8, I use the following:

 inline size_t size( Byte ch ) { return byteCountTable[ ch ] ; } template< typename InputIterator > InputIterator succ( InputIterator begin, size_t size, std::random_access_iterator_tag ) { return begin + size ; } template< typename InputIterator > InputIterator succ( InputIterator begin, size_t size, std::input_iterator_tag ) { while ( size != 0 ) { ++ begin ; -- size ; } return begin ; } template< typename InputIterator > InputIterator succ( InputIterator begin, InputIterator end ) { if ( begin != end ) { begin = succ( begin, end, size( *begin ), std::::iterator_traits< InputIterator >::iterator_category() ) ; } return begin ; } template< typename InputIterator > size_t characterCount( InputIterator begin, InputIterator end ) { size_t result = 0 ; while ( begin != end ) { ++ result ; begin = succ( begin, end ) ; } return result ; } 
+1
source

Simplified version. based on the provided solution Getting the actual length of encoded UTF-8 std :: string? Marcelo Cantosa

 std::string substr(std::string originalString, int maxLength) { std::string resultString = originalString; int len = 0; int byteCount = 0; const char* aStr = originalString.c_str(); while(*aStr) { if( (*aStr & 0xc0) != 0x80 ) len += 1; if(len>maxLength) { resultString = resultString.substr(0, byteCount); break; } byteCount++; aStr++; } return resultString; } 
+6
source

A std::string object is not a character string, it is a string of bytes. It has no idea what is called encoding . The same goes for std::wstring , except that it is a string of 16-bit values.

To perform operations on text that requires the addressing of individual characters (for example, if you want to take a substring, for example), you need to know what encoding is used for your std :: string object.

UPDATE: Now that you have explained that your input string is encoded in UTF-8, you still need to decide which encoding to use for your std::wstring output. UTF-16 comes to mind, but actually it depends on which API you will pass to std::wstring objects. Assuming UTF-16 is acceptable, you have various options:

  • On Windows, you can use the MultiByteToWideChar function; no additional dependencies required.
  • The UTF8-CPP library claims to provide an easy solution for handling UTF- * encoded strings. I have never tried it myself, but I hear about it all the time.
  • On Linux systems, the libiconv library is used quite often.
  • If you need to deal with all kinds of crazy encodings and want the full-sized alpha-and-omega word to reach the encodings, look at the ICU .
+5
source

Unicode is hard.

  • std::wstring not a list of code points, it is a wchar_t list, and their width is determined by the implementation (usually 16 bits with VC ++ and 32 bits with gcc and clang). Yes, that means it's useless for portable code ...
  • One character can be encoded at multiple code points (due to diacritics )
  • In some language, two different characters together form a β€œunit” that is not truly separable (for example, LL is considered the very letter in Spanish).

So ... it's a little complicated.

Solution 3) can be costly (this requires specific language / use annotations); the solution 1) and 2) is absolutely necessary ... and requires that Unicode libraries know or code your own (and probably were wrong).

  • 1) trivially solved: writing a normal transformation from UTF-8 to CodePoint is trivial (CodePoint can be represented using uint32_t )
  • 2) more complicated, this requires a list of diacritics, and the subprogram must know that it is never cut to diacritical (they follow the character that they qualify).

Otherwise, it is possible that you are looking at the ICU . I wish you good luck with this.

+1
source

Allow yourself for simplicity to assume that your encoding is UTF-8. In this case, we will have several characters occupying more than one byte, as in your case. Then you have std :: string where these UTF-8 encoded characters are stored. And now you want to use substr () in terms of characters, not bytes. I would write a function that converts a character length to a byte length. For utf 8, it would look like this:

 #define UTF8_CHAR_LEN( byte ) (( 0xE5000000 >> (( byte >> 3 ) & 0x1e )) & 3 ) + 1 int32 GetByteCountForCharCount(const char* utf8Str, int charCnt) { int ByteCount = 0; for (int i = 0; i < charCnt; i++) { int charlen = UTF8_CHAR_LEN(*utf8Str); ByteCount += charlen; utf8Str += charlen; } return ByteCount; } 

So, say you want to fine-tune () the string from the 7th char. No problems:

 int32 pos = GetByteCountForCharCount(str.c_str(), 7); str.substr(pos); 
0
source

Based on this, I wrote my utf8 substring function:

 void utf8substr(std::string originalString, int SubStrLength, std::string& csSubstring) { int len = 0, byteIndex = 0; const char* aStr = originalString.c_str(); size_t origSize = originalString.size(); for (byteIndex=0; byteIndex < origSize; byteIndex++) { if((aStr[byteIndex] & 0xc0) != 0x80) len += 1; if(len >= SubStrLength) break; } csSubstring = originalString.substr(0, byteIndex); } 
0
source

Source: https://habr.com/ru/post/989580/


All Articles