C # UNICODE for ANSI conversion

Question

C # UNICODE for ANSI conversion

I need your help regarding something that bothers me when working with UNICODE encoding in the .NET Framework ...

I need to interact with some client data systems with applications other than UNICODE, and these clients have global companies (Chinese, Korean, Russian, ...). Therefore, they should provide me with an ASCII 8-bit file, which will be encoded with their Windows code page.

So, if the Greek client sends me a text file containing "Σ" (the sigma letter "\ u03A3") in the product name, I will get the equivalent letter corresponding to the 211 ANSI code points presented in my own code page. My computer is French Windows, which means that the code page is Windows-1252, so I will have "Ó" in this text file ... Ok.

I know that this client is Greek, so I can read its file by forcing the windows-1253 codepage in my import options.

/// <summary> /// Convert a string ASCII value using code page encoding to Unicode encoding /// </summary> /// <param name="value"></param> /// <returns></returns> public static string ToUnicode(string value, int codePage) { Encoding windows = Encoding.Default; Encoding unicode = Encoding.Unicode; Encoding sp = Encoding.GetEncoding(codePage); if (sp != null && !String.IsNullOrEmpty(value)) { // First get bytes in windows encoding byte[] wbytes = windows.GetBytes(value); // Check if CodePage to use is different from current Windows one if (windows.CodePage != sp.CodePage) { // Convert to Unicode using SP code page byte[] ubytes = Encoding.Convert(sp, unicode, wbytes); return unicode.GetString(ubytes); } else { // Directly convert to Unicode using windows code page byte[] ubytes = Encoding.Convert(windows, unicode, wbytes); return unicode.GetString(ubytes); } } else { return value; } }

Well, in the end, I got "Σ" in my application, and I can save it in my SQL Server database. Now my application should do some complicated calculations, and then I have to return this file to the client with automatic export ...

So my problem is that I need to do the UNICODE => ANSI conversion ?! But it is not as simple as I thought at the beginning ...

I do not want to save the code page used during import, so my first idea was to convert UNICODE to windows-1252, and then automatically send the file to clients. They will read the exported text file with their own code page to make this idea interesting to me.

But the problem is that the transformation in this way has strange behavior ... Here are two different examples:

1st example (s)

 char ya = '\u042F'; string strYa = Char.ConvertFromUtf32(ya); System.Text.Encoding unicode = System.Text.Encoding.Unicode; System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252); System.Text.Encoding ansi1251 = System.Text.Encoding.GetEncoding(1251); string strYa1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strYa))); string strYa1251 = ansi1251.GetString(System.Text.Encoding.Convert(unicode, ansi1251, unicode.GetBytes(strYa)));

So strYa1252 contains ? ', while strYa1251 contains a valid char' i . Therefore, it seems that it is not possible to convert text to ANSI unless a valid code page is specified for the Convert () function. So, nothing in the Unicode coding class helps the user get equivalence between ANSI and UNICODE codes?: \

Second example (Σ)

 char sigma = '\u3A3'; string strSigma = Char.ConvertFromUtf32(sigma); System.Text.Encoding unicode = System.Text.Encoding.Unicode; System.Text.Encoding ansi1252 = System.Text.Encoding.GetEncoding(1252); System.Text.Encoding ansi1253 = System.Text.Encoding.GetEncoding(1253); string strSigma1252 = ansi1252.GetString(System.Text.Encoding.Convert(unicode, ansi1252, unicode.GetBytes(strSigma))); string strSigma1253 = ansi1253.GetString(System.Text.Encoding.Convert(unicode, ansi1253, unicode.GetBytes(strSigma)));

At this time, I have the correct Σ 'in the string strSigma1253 , but I also have S ' for strSigma1252 . As indicated at the beginning, I should have " Ó " if ANSI code is found, or "?" if the character is not found, but not "S". What for? Yes, of course, the linguist could say that "S" is equivalent to the Greek character Sigma, because they sound the same in both alphabets, but they do not have the same ANSI code!

So, how does the Convert () function in the .NET platform manage this equivalence?

And does anyone have an idea to write ANSI characters from UNICODE in text files that I have to send to clients?

+4

c # unicode ansi

alex Jun 10 '13 at 11:54

source share

1 answer

bobince · Accepted Answer · 2013-06-10T22:03:18+0000

I should have ... '?' if the character is not found, but not "S". Why?

This is called the "best match" encoding, and in most cases it is bad. When Windows cannot encode a character on the target code page (since Σ does not exist on code page 1252), it makes every effort to match the character with something similar to it. This may mean the loss of diacritics ( ë → e ) or matching with the corresponding ( Σ → S ) character connected ( ≤ → = ), a character that is not connected but looks a bit similar ( ∞ → 8 ), or whatever , another substitute for a madman seemed like a good idea at the time, but in practice is culturally or mathematically offensive.

You can see the tables for cp1252, including the Sigma mapping, here .

Besides tacitly manipulating dubious utility, it also has some pretty bad security implications . You can stop this by setting EncoderFallback to ReplacementFallback or ExceptionFallback .

Does anyone have an idea to write ANSI characters from UNICODE in text files that I have to send to clients?

You will need to keep an encoding table for each client. Read their input files using this encoding for decoding; write their output files using the same encoding.

(For convenience, install new clients in UTF-8 and indicate that this is the preferred encoding.)

C # UNICODE for ANSI conversion

1st example (s)

Second example (Σ)

More articles: