Convert UTF-8 to ANSI in C ++

I can not find the answer to this question anywhere.

How to convert string from UTF-8 to ANSI (extended ASCII) in C ++?

+2
source share
3 answers

Typically, libiconv ( web page ) is used, which is portable and works on most platforms. As KerrekSB mentioned, you will run into serious problems if you think of the character set as "extended ASCII." I am sure that there are at least one hundred character sets that can be called "extended ASCII", including UTF-8.

Also, make sure you know which encoding you want: ISO-8859-1 or CP1252. The Windows version replaces the C1 control codes with additional printable characters.

+6
source

Assuming that with "ANSI" you really mean one of the variants of ISO 8859, we should start with a couple of points.

Firstly, not every line can be converted from UTF-8 (or Unicode in general, regardless of the conversion used) to ISO 8859. Unicode has a unique code point for almost every character in every language on earth.

ISO 8859 supports far fewer languages ​​and has a separate character set for each language that it supports; the same codes represent different characters in different languages.

This means that for a UTF-8 input string, it is fairly easy to contain characters that cannot be represented in any version of ISO 8859 at all, and it is also easy to contain characters that require different versions of ISO 8859.

Secondly, even in the best case, the transformation can be completely nontrivial. If at all possible, you will almost certainly want to use a library (such as libiconv) for this task. For example, Unicode has ... a function called "combining diacritical marks" that allows you to encode something like "A with a sharp accent" as one code point or two separate code points (one for "A" and the other for emphasis). To code this in ISO 8859, you will need to convert all the forms into one form (usually a pre-combined form).

Before doing any significant work with Unicode, you also usually want to convert UTF-8 to UCS-4.

So the sequence will be something like this:

  • Convert UTF-8 to UCS-4
  • Converting a combination of diacritical marks to letters with diacritical marks (possibly NFKC).
  • Make sure all characters can be encoded in the target character set.
  • Convert to Target Set

Depending on how you prefer to do something, you can combine 3 and 4 in one step, convert characters along the way, and, for example, throw an exception if you encounter a character that cannot be represented in the target character set.

+2
source

Windows only:

string UTF8ToANSI(string s) { BSTR bstrWide; char* pszAnsi; int nLength; const char *pszCode = s.c_str(); nLength = MultiByteToWideChar(CP_UTF8, 0, pszCode, strlen(pszCode) + 1, NULL, NULL); bstrWide = SysAllocStringLen(NULL, nLength); MultiByteToWideChar(CP_UTF8, 0, pszCode, strlen(pszCode) + 1, bstrWide, nLength); nLength = WideCharToMultiByte(CP_ACP, 0, bstrWide, -1, NULL, 0, NULL, NULL); pszAnsi = new char[nLength]; WideCharToMultiByte(CP_ACP, 0, bstrWide, -1, pszAnsi, nLength, NULL, NULL); SysFreeString(bstrWide); string r(pszAnsi); delete[] pszAnsi; return r; } 
+1
source

Source: https://habr.com/ru/post/1490630/


All Articles