The "clean" ASCII character set is limited to 0-127 (7 bits). The 8-bit characters with the most significant bit set (i.e., in the range 128-255) are not uniquely defined: their definition depends on the code page . So your character ą (LATIN SMALL LETTER A WITH OGONEK) is represented by the value 0xB9 on the specific code page, which should be Windows-1250 . On other code pages, the value 0xB9 is associated with another character (for example, in the code page of Windows 1252 , 0xB9 to the symbol ¹ , that is, the superscript 1).
To convert characters from a specific code page to Unicode UTF-16 using the Windows Win32 API, you can use MultiByteToWideChar by specifying the correct code page (which is not CP_UTF8 , as written in your question code, actually CP_UTF8 identifies Unicode UTF-8) . You can try specifying 1250 (ANSI Central European, Central European (Windows)) as the correct codepage identifier .
If you have access to ATL in your code, you can use the convenience of ATL string conversion helper classes like CA2W , which wraps a call to MultiByteToWideChar( ) and memory allocation in the RAII class; eg:.
#include <atlconv.h> // ATL String Conversion Helpers // 'test' is a Unicode UTF-16 string. // Conversion is done from code-page 1250 // (ANSI Central European; Central European (Windows)) CA2W test("ąółź", 1250);
Now you can use the test string in your Unicode APIs.
If you do not have access to ATL or require a C ++ STL solution, you may need to consider the following code:
/////////////////////////////////////////////////////////////////////////////// // // Modern STL-based C++ wrapper to Win32 MultiByteToWideChar() C API. // // (based on http://code.msdn.microsoft.com/windowsdesktop/C-UTF-8-Conversion-Helpers-22c0a664) // ///////////////////////////////////////////////////////////////////////////////
source share