Translation of a string sequence into bytes using fixed encoding, preferably UTF-8

Question

Translation of a string sequence into bytes using fixed encoding, preferably UTF-8

In a windows C ++ console application, I would like to read the password from the input on the command line. The password is used for encryption (and later decryption, possibly in other countries of the world on a Windows PC with a different locale). Therefore, I am worried about the locales and coding of this code phrase, which does not give the same numerical representation. On the same computer or computer with the same language, this clearly does not give a problem.

Therefore, I would like to be able to fix the encoding (and normalize?) And store it as UTF-8. which is recommended here: http://www.jasypt.org/howtoencryptuserpasswords.html (paragraph 4).

There are many problems associated with the encoding / unicode / UTF-8 / code pages, which I do not fully (or completely do not understand). I was messing around with boost: locale and boost :: nowide, but couldn't figure out whether or not it works under windows (dunno). Some links with more detailed information about problems (windows):

http://alfps.wordpress.com/2011/11/22/unicode-part-1-windows-console-io-approaches/

http://alfps.wordpress.com/2011/12/08/unicode-part-2-utf-8-stream-mode/

But these links address the opposite problem! How to make things look the same no matter what the main view, I need the same basic [bitwise] view, no matter how it looks!

So the question is, how can I be sure (and should I?) So that the language / encoding does not affect the main data that is encrypted, the data, as in the sense of an array of 8-bit integers? I don't necessarily care about UTF-8 or Unicode, I just need to be able to recover data regardless of language / encoding. The first link will help in explaining the problem.

Thoughts, C does not know about Unicode, refers to some help on C-code, or does C ++ change it again? Or will it restrict the input of the characters "ASCII" (I know that it does not exist in Windows) ALWAYS, as in "on any Windows computer") work?

Decision:

void EncryptFileNames ( const boost::filesystem::path& p, const std::string& pw ); int main ( int argc, char **argv ) // No checking { // Call with encrypt.exe c:\tmp pässwörd boost::nowide::args a ( argc, argv ); // Fix arguments - make them UTF-8 boost::filesystem::path p ( argv [ 1 ] ); EncryptFileNames ( p, boost::locale::normalize ( argv [ 2 ], boost::locale::norm_nfc, std::locale ( ) ) ); return 0; }

Thanks to all the contributors.

PS: For encryption I use Crypto ++ with VS2008SP1 and Boost (without ICU backend).

+4

c ++ windows unicode utf-8 codepages

degski Sep 08 '12 at 12:17

source share

2 answers

If your application is compiled with _UNICODE , just call WideCharToMultiByte with the UTF-8 code page to get UTF-8. If your application has not been compiled with _UNICODE , call MultiByteToWideChar to get UTF-16 from your ACP bytes, and then call WideCharToMultiByte to get UTF-8.

Since the added code shows std :: string, the data seems to be in the ACP for the system. So the recipe will work here. Now for this there are many convenient APIs such as mbtowcs . Do not be distracted by the "MB". It is just Windows-says for "not UTF-16."

+2

bmargulies Sep 08 '12 at 12:52

source share

john · Accepted Answer · 2012-09-08T13:43:06+0000

Firstly, UTF-8 is a red herring. To be international, you must use an international character set, there is only one worthy of attention, and it is called Unicode. How you represent Unicode in your program (that is, how you encode it) is up to you, if the encoding can represent all Unicode, no problem. You can choose UTF-8, but since you are running Windows, it seems reasonable to choose the encoding that Windows uses internally, which is UTF-16. Since bmargulies says you can use MultiByteToWideChar to get from a local view (i.e. a local codepage) to UTF-16. I do not see the need to take an extra step and generate UTF-8 from UTF-16, but if you want to do this, you can use WideCharToMultiByte.

Translation of a string sequence into bytes using fixed encoding, preferably UTF-8

More articles: