Unicode exceptions what ()

Or "how do Russians throw exceptions?"

Definition of std exception:

namespace std { class exception { public: exception() throw(); exception(const exception&) throw(); exception& operator=(const exception&) throw(); virtual ~exception() throw(); virtual const char* what() const throw(); }; } 

A popular school of thought for designing an exception hierarchy is the output from std :: exception:

As a rule, it is best to throw objects that are not embedded. If possible, you should throw class instances that infer (ultimately) from the std :: class exception. Having made your exception class inherits (ultimately) from the standard Source base class that you make life for your users (they have the ability to catch most things through std :: exception), plus you are likely to provide them with more information (e.g. the fact that your specific exception might be a refinement of std :: runtime_error or something else) .std :: runtime_error or something else).

But in the face of Unicode, it seems impossible to create an exception hierarchy that allows you to do both of the following:

  • It ultimately comes from std :: exception for ease of use on the catch site
  • Provides Unicode compatibility so diagnostics are not cut or gibberish

Running an exception class that can be constructed using Unicode strings is quite simple. But the standard dictates that what () should return const char *, so at some point the input lines must be converted to ASCII. Regardless of whether this is done at build time or when () is called if the original string uses characters that cannot be represented by 7-bit ASCII), it may not be possible to format the message without losing accuracy.

How do you create an exception hierarchy that combines seamless integration of the std :: exception class with lossless Unicode handling?

+45
c ++ c ++ 11
Sep 21 '10 at 13:27
source share
8 answers

char * does not mean ASCII. You can use 8-bit Unicode encoding such as UTF-8. char can also be 16 bits or more, you can use UTF-16.

+31
Sep 21 '10 at 13:34
source share
β€” -

The return of UTF-8 is an obvious choice. If an application using your exceptions uses a different multibyte encoding, it may be difficult to display the string. (He can't know this UTF-8, right?) On the other hand, for ISO-8859- * 8-bit encodings (Western European, Cyrillic, etc.) Displaying the UTF-8 string will be "simple" some gibberish is displayed, and you (or your user) may be ok with this if you cannot disambiguate by the way. a char * in the locale and UTF-8 character set.

Personally, I think that only low-level error messages should appear on lines () and personally, I think they should be English. (Perhaps combined with some errors or something else.)

The worst problem that I see with what() is that the what () message, for example, the file name, often includes some contextual data. File names are not ASCII quite often, so you have no choice but to use UTF-8 as the encoding what() .

Note also that your exception class (derived from std :: exception) can obviously provide any access methods you like, and therefore it makes sense to add explicit what_utf8() or what_utf16() or what_iso8859_5() .

Edit: Regarding John's comment on how to return UTF-8:

If you have a const char* what() function, this function essentially returns a bunch of bytes. On the Western European Windows platform, these bytes are usually encoded as Win1252 , but in Russian windows there can also be Win1251 .

What the bytes mean depends on their encoding, and their encoding depends on where they came from (and who interprets them). String literal encoding is determined at compile time, but at runtime it still depends on the application how to interpret them.

So for your exception to return UTF-8 strings with what() (or what_utf8() ), you must make sure that:

  • The input message for your exception has a well-defined encoding
  • You have a well-defined encoding for the string member that you use to store the message.
  • You convert the encoding accordingly when what() is called

Example:

 struct MyExc : virtual public std::exception { MyExc(const char* msg) : exception(msg) { } std::string what_utf8() { return convert_iso8859_1_to_utf8( what() ); } }; // In a ISO-8859-1 encoded source file const char* my_err_msg = "ISO-8859-1 ... Àâüß ..."; ... throw MyExc(my_err_msg); ... catch(MyExc const& e) { std::string iso8859_1_msg = e.what(); std::string utf_msg = e.what_utf8(); ... 

The conversion can also be placed in the (overridden) function what () of the MyExc () member or you can throw an exception to take an already encoded UTF-8 string or you could convert (from the expected input encoding, possibly wchar_t / UTF-16) to ctor.

+6
Sep 21 '10 at 14:15
source share

First question: what do you intend to do with the what () line?

Do you plan to record information anywhere?

If you should not use the contents of the what () line, you should use this line as a link to find the correct local specific log message. Thus, for what () content, not for logging purposes (or any form of display), this is a method for finding the actual logging string (which can be any Unicode string).

Now; It can be completely populated for the what () line containing a humanoid message for developers to help in quick debugging (but this does not require highly readable polished text). As a result, there is no reason to support anything more than ASCII. Follow the KISS Principle.

+4
Sep 21 '10 at 15:24
source share

A const char * must not point to an ASCII string; it may be in multibyte encoding such as UTF-8. One option is to use wcstombs() and friends to convert wstrings to strings, but you may have to convert the result of what() back to wstring before printing. It also requires more copies and memory allocation than you might find convenient in an exception handler.

Usually I just define my own base exception class, which uses wstring instead of a string in the constructor and returns const wstring & from what() . This is not such a big deal. The lack of a standard is quite a lot of supervision.

Another reliable opinion is that exception lines should never be presented to the user, so their localization is not needed, so you do not need to worry about it.

+3
Sep 21 '10 at 13:47
source share

The standard does not specify which encoding is the string returned by the function (), and there is no defacto standard. I just encode it as UTF-8 and come back with what () in my projects. Of course, there may be incompatibility with other libraries.

See also: https://stackoverflow.com/questions/1049947/should-utf-16-be-considered-harmful why UTF-8 is a good choice.

+2
Sep 21 '10 at 13:35
source share

Absolute Minimum Every software developer Absolutely, positively needs to know about Unicode and character sets (no excuses!) Joel Spolsky

Edit: Made by CW, commentators can edit why this link matters if they wish

+2
Sep 21 '10 at 13:39
source share

what () is usually not intended to display a message to the user. Among other things, the text that it returns is not localizable (even if it was Unicode). I would just use what () to display something useful for you as a developer (for example, the source file and line number of the place where the exception was thrown), and for this type of text, ASCII is usually more than enough.

+1
Sep 21 '10 at 14:22
source share

This is the best way to add Unicode to error handling:

 try { // some code } catch (std::exception & ex) { report_problem(ex.what()) } 

AND:

 void report_problem(char const * const) { // here we can convert char to wchar_t or do some more else // log it, save to file or message to user } 
+1
Nov 29 '10 at
source share



All Articles