Characters Supported in C ++

There seems to be a problem when I write words in foreign characters (French ...)

For example, if I request input for std :: string or char [] as follows:

std::string s;
std::cin>>s;  //if we input the string "café"
std::cout<<s<<std::endl;  //outputs "café"

Everything is fine.

Although the string is hardcoded

std::string s="café";
std::cout<<s<<std::endl; //outputs "cafÚ"

What's happening? What characters are supported by C ++ and how can I make it work correctly? Is this related to my operating system (Windows 10)? My IDE (VS 15)? Or with C ++?

+4
source share
4 answers

, / Unicode / Windows 10 ( Windows), IE, std:: wstring. Windows UTF-8. .

API Win32, , , Unicode UTF-16, C/++, Visual Studio, API UTF-8 . , UTF-8, , , Win32 API C/++, , UTF-8 UTF -16. , , .

, , UTF-8. , . Unicode .

. UTF-8/UTF-16 , ++ , , :

///////////////////////////////////////////////////////////////////////////////////////////////////
std::wstring UTF8ToUTF16(const std::string& stringUTF8)
{
    // Convert the encoding of the supplied string
    std::wstring stringUTF16;
    size_t sourceStringPos = 0;
    size_t sourceStringSize = stringUTF8.size();
    stringUTF16.reserve(sourceStringSize);
    while (sourceStringPos < sourceStringSize)
    {
        // Determine the number of code units required for the next character
        static const unsigned int codeUnitCountLookup[] = { 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 4 };
        unsigned int codeUnitCount = codeUnitCountLookup[(unsigned char)stringUTF8[sourceStringPos] >> 4];

        // Ensure that the requested number of code units are left in the source string
        if ((sourceStringPos + codeUnitCount) > sourceStringSize)
        {
            break;
        }

        // Convert the encoding of this character
        switch (codeUnitCount)
        {
        case 1:
        {
            stringUTF16.push_back((wchar_t)stringUTF8[sourceStringPos]);
            break;
        }
        case 2:
        {
            unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x1F) << 6) |
                                            ((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F);
            stringUTF16.push_back((wchar_t)unicodeCodePoint);
            break;
        }
        case 3:
        {
            unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x0F) << 12) |
                                            (((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 6) |
                                            ((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F);
            stringUTF16.push_back((wchar_t)unicodeCodePoint);
            break;
        }
        case 4:
        {
            unsigned int unicodeCodePoint = (((unsigned int)stringUTF8[sourceStringPos] & 0x07) << 18) |
                                            (((unsigned int)stringUTF8[sourceStringPos + 1] & 0x3F) << 12) |
                                            (((unsigned int)stringUTF8[sourceStringPos + 2] & 0x3F) << 6) |
                                            ((unsigned int)stringUTF8[sourceStringPos + 3] & 0x3F);
            wchar_t convertedCodeUnit1 = 0xD800 | (((unicodeCodePoint - 0x10000) >> 10) & 0x03FF);
            wchar_t convertedCodeUnit2 = 0xDC00 | ((unicodeCodePoint - 0x10000) & 0x03FF);
            stringUTF16.push_back(convertedCodeUnit1);
            stringUTF16.push_back(convertedCodeUnit2);
            break;
        }
        }

        // Advance past the converted code units
        sourceStringPos += codeUnitCount;
    }

    // Return the converted string to the caller
    return stringUTF16;
}

///////////////////////////////////////////////////////////////////////////////////////////////////
std::string UTF16ToUTF8(const std::wstring& stringUTF16)
{
    // Convert the encoding of the supplied string
    std::string stringUTF8;
    size_t sourceStringPos = 0;
    size_t sourceStringSize = stringUTF16.size();
    stringUTF8.reserve(sourceStringSize * 2);
    while (sourceStringPos < sourceStringSize)
    {
        // Check if a surrogate pair is used for this character
        bool usesSurrogatePair = (((unsigned int)stringUTF16[sourceStringPos] & 0xF800) == 0xD800);

        // Ensure that the requested number of code units are left in the source string
        if (usesSurrogatePair && ((sourceStringPos + 2) > sourceStringSize))
        {
            break;
        }

        // Decode the character from UTF-16 encoding
        unsigned int unicodeCodePoint;
        if (usesSurrogatePair)
        {
            unicodeCodePoint = 0x10000 + ((((unsigned int)stringUTF16[sourceStringPos] & 0x03FF) << 10) | ((unsigned int)stringUTF16[sourceStringPos + 1] & 0x03FF));
        }
        else
        {
            unicodeCodePoint = (unsigned int)stringUTF16[sourceStringPos];
        }

        // Encode the character into UTF-8 encoding
        if (unicodeCodePoint <= 0x7F)
        {
            stringUTF8.push_back((char)unicodeCodePoint);
        }
        else if (unicodeCodePoint <= 0x07FF)
        {
            char convertedCodeUnit1 = (char)(0xC0 | (unicodeCodePoint >> 6));
            char convertedCodeUnit2 = (char)(0x80 | (unicodeCodePoint & 0x3F));
            stringUTF8.push_back(convertedCodeUnit1);
            stringUTF8.push_back(convertedCodeUnit2);
        }
        else if (unicodeCodePoint <= 0xFFFF)
        {
            char convertedCodeUnit1 = (char)(0xE0 | (unicodeCodePoint >> 12));
            char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
            char convertedCodeUnit3 = (char)(0x80 | (unicodeCodePoint & 0x3F));
            stringUTF8.push_back(convertedCodeUnit1);
            stringUTF8.push_back(convertedCodeUnit2);
            stringUTF8.push_back(convertedCodeUnit3);
        }
        else
        {
            char convertedCodeUnit1 = (char)(0xF0 | (unicodeCodePoint >> 18));
            char convertedCodeUnit2 = (char)(0x80 | ((unicodeCodePoint >> 12) & 0x3F));
            char convertedCodeUnit3 = (char)(0x80 | ((unicodeCodePoint >> 6) & 0x3F));
            char convertedCodeUnit4 = (char)(0x80 | (unicodeCodePoint & 0x3F));
            stringUTF8.push_back(convertedCodeUnit1);
            stringUTF8.push_back(convertedCodeUnit2);
            stringUTF8.push_back(convertedCodeUnit3);
            stringUTF8.push_back(convertedCodeUnit4);
        }

        // Advance past the converted code units
        sourceStringPos += (usesSurrogatePair) ? 2 : 1;
    }

    // Return the converted string to the caller
    return stringUTF8;
}

6- Windows Unicode, ASCII ( Unicode), std::string char [] . , UTF-8 UTF-8 UTF-16 Win32 API. , .

, Windows, . , , ​​ UTF-8, Windows C/++ .

2: , , . , .

? ++ Windows, std::string std:: cin/std:: cout, - MBCS. , . , , unicode , , , . , . , . , , . , , . , , . MBCS , - , unicode. Unicode - , , " , ". , , UTF-8 "" Windows. . , . , MBCS, Unicode . Unicode Windows - std:: wstring UTF-16 Win32 API.

, , , ASCII, , . Visual Studio ( "- > " ). , , ( ) UTF-8, , , MBCS , UTF-8. , ASCII, \x. ++ 11 , . , . , , - .

+3

Windows. (UTF-16) , (Windows-1252) , ( 850 ) -. Windows-1252, é '\xe9'. 850, Ú. u8"é" "\xc3\xa9", ├®.

, -ASCII- . .

std::string s="caf\x82";

u16 WideCharToMultiByte.

+2

++

++ , . .

- ...

... ++?

.

... IDE?

, IDE .

... ?

.

.

  • .
  • , .
    • , , ( ).
    • , , , .
  • , .
    • , , ( ).
  • . , CHAR_BIT. / , . , , . . , .

:

The source file is encoded in UTF-8. The compiler expects UTF-8. The terminal is expecting UTF-8. In this case, you see what you get.

+1
source

The trick here is setlocale:

#include <clocale>
#include <string>
#include <iostream>

int main() {
    std::setlocale(LC_ALL, "");
    std::string const s("café");
    std::cout << s << '\n';
}

The result for me with the Windows 10 command line is correct, even without changing the terminal code page.

0
source

Source: https://habr.com/ru/post/1667560/


All Articles