How to choose the right encoding to decode CArchive encoded content

In .net I want to decode some raw data encoded by a C ++ application. The C ++ application is 32-bit, and the C # application is 64-bit.

C ++ application supports Russian and Spanish characters , but does not support Unicode characters . This binary C # reader does not read Russian or Spanish characters and only works for English ascii characters.

CArchive does not indicate any encoding, and I'm not sure how to read it from C #.

I tested this for a couple of simple lines, this is what C ++ CArchive provides:

For "ABC": "03 41 42 43"

For "Week 7555Â": "0B C1 E5 EB C0 C7 20 37 35 35 35 C2"

The following shows how a C ++ application writes a binary file.

void CColumnDefArray::SerializeData(CArchive& Archive)
{
    int iIndex;
    int iSize;
    int iTemp;
    CString sTemp;

    if (Archive.IsStoring())
    {
        Archive << m_iBaseDataCol;
        Archive << m_iNPValueCol;

        iSize = GetSize();
        Archive << iSize;
        for (iIndex = 0; iIndex < iSize; iIndex++)
        {
            CColumnDef& ColumnDef = ElementAt(iIndex);
            Archive << (int)ColumnDef.GetColumnType();
            Archive << ColumnDef.GetColumnId();
            sTemp = ColumnDef.GetName();
            Archive << sTemp;
        }
    }
}

And this is how I try to read it in C #.

The following can decode "ABC", but not Russian characteristics. I tested this.Encodingwith all available options (Ascii, UTF7, etc.). Russian characters work only for Encoding.Default. But, apparently, this is not a reliable option, since encoding and decoding usually occurs on different PCs.

        public override string ReadString()
        {
            byte blen = ReadByte();
            if (blen < 0xff)
            {
                // *** For russian characters it comes here.***
                return this.Encoding.GetString(ReadBytes(blen));
            }

            var slen = (ushort) ReadInt16();
            if (slen == 0xfffe)
            {
                throw new NotSupportedException(ServerMessages.UnicodeStringsAreNotSupported());
            }

            if (slen < 0xffff)
            {
                return this.Encoding.GetString(ReadBytes(slen));
            }

            var ulen = (uint) ReadInt32();
            if (ulen < 0xffffffff)
            {
                var bytes = new byte[ulen];
                for (uint i = 0; i < ulen; i++)
                {
                    bytes[i] = ReadByte();
                }

                return this.Encoding.GetString(bytes);
            }

            //// Not support for 8-byte lengths 
            throw new NotSupportedException(ServerMessages.EightByteLengthStringsAreNotSupported());
        }

What is the correct approach to decoding this? Do you think that choosing the right code page is a way to solve this problem? If so, how do you know which codepage was used for encoding?

Appreciate if someone can show me the right direction to do this.

Edit

" , Unicode Character Sets (No Excuses!)" . -, .

, : - , , ? ++ CArchive?

+4
1

-Unicode ++ 0B C1 E5 EB C0 C7 20 37 35 35 35 C2 ( , bytes)

"ÁåëÀÇ 7555Â" bytes 1252

"ÁåëÀÇ 7555Â". , :

string result = Encoding.Default.GetString(bytes);

1252. , "ÁåëÀÇ 7555Â" :

//result will be `"ÁåëÀÇ 7555Â"`, always
Encoding cp1252 = Encoding.GetEncoding(1252);
string result = cp1252.GetString(bytes);



. :
string greek = "ελληνικά";
Encoding cp1253 = Encoding.GetEncoding(1253);
var bytes = cp1253.GetBytes(greek);

bytes ++. :

//result will be "åëëçíéêÜ"
Encoding cp1252 = Encoding.GetEncoding(1252);
string result = cp1252.GetString(bytes);

"åëëçíéêÜ". "ελληνικά"

//result will be "ελληνικά"
Encoding cp1253 = Encoding.GetEncoding(1253);
string greek_decoded = cp1253.GetString(bytes);

, , , ++ ( Hans Passant)

:

public override string ReadString()
{
    //Default code page if both programs use the same code page
    Encoding encoder = System.Text.Encoding.Default;

    //or find out what code page the C++ program is using
    //Encoding encoder = System.Text.Encoding.GetEncoding(codepage);

    //or use English code page to always get "ÁåëÀÇ 7555Â"...
    //Encoding encoder = System.Text.Encoding.GetEncoding(1252);
    //(not recommended)

    byte blen = ReadByte();
    if (blen < 0xff)
        return encoder.GetString(ReadBytes(blen));

    var slen = (ushort)ReadInt16();
    if (slen == 0xfffe)
        throw new NotSupportedException(
            ServerMessages.UnicodeStringsAreNotSupported());

    if (slen < 0xffff)
        return encoder.GetString(ReadBytes(blen));

    var ulen = (uint)ReadInt32();
    if (ulen < 0xffffffff)
    {
        var bytes = new byte[ulen];
        for (uint i = 0; i < ulen; i++)
            bytes[i] = ReadByte();
        return encoder.GetString(ReadBytes(blen));
    }

    throw new NotSupportedException(
        ServerMessages.EightByteLengthStringsAreNotSupported());
}

:

, Unicode MFC, , . char 255 . 255 , , , ...

1252 . 1253 ..

MFC .

(, , , , , , ..) 1252. , . System.Text.Encoding.Default System.Text.Encoding.GetEncoding(variable_codepage)

Windows ANSI

874 – Windows Thai
1250 – Windows Central and East European Latin 2
1251 – Windows Cyrillic
1252 – Windows West European Latin 1
1253 – Windows Greek
1254 – Windows Turkish
1255 – Windows Hebrew
1256 – Windows Arabic
1257 – Windows Baltic
1258 – Windows Vietnamese

. Unicode ANSI, .

, , . . Unicode .

. Unicode

0

Source: https://habr.com/ru/post/1656799/


All Articles