How can I avoid string mix coding in the C / C ++ API?

I am working on the implementation of various APIs in C and C ++ and wondering what methods are available in order to avoid the fact that clients mistakenly receive the encoding by receiving strings from the framework or passing them back. For example, imagine a simple C ++ plugin API that clients can implement to influence translations. It may have a function like this:

const char *getTranslatedWord( const char *englishWord ); 

Now let's say that I would like to ensure that all strings are passed as UTF-8. Of course, I would document this requirement, but I would like the compiler to provide the correct encoding, possibly using special types. For example, something like this:

 class Word { public: static Word fromUtf8( const char *data ) { return Word( data ); } const char *toUtf8() { return m_data; } private: Word( const char *data ) : m_data( data ) { } const char *m_data; }; 

Now I can use this specialized type in the API:

 Word getTranslatedWord( const Word &englishWord ); 

Unfortunately, it is easy to make this very inefficient. There are no proper copy constructors, assignment operators, etc. in the Word class, and I would like to avoid unnecessary data copying as much as possible. In addition, I see the danger that Word will expand with a variety of utility functions (e.g. length or fromLatin1 or substr , etc.), and I would prefer not to write another class of strings. I just want a small container that avoids accidental mixing coding.

I wonder if anyone has any other experience with this and can share some useful methods.

EDIT: In my particular case, the API is used on Windows and Linux using MSVC 6 - MSVC 10 for Windows and gcc 3 and 4 on Linux.

+4
source share
3 answers

You can pass arround instead of std :: pair instead of char *:

 struct utf8_tag_t{} utf8_tag; std::pair<const char*,utf8_tag_t> getTranslatedWord(std::pair<const char*,utf8_tag_t> englishWord); 

The generated machine code should be identical on a decent modern compiler that uses empty base class optimization for std :: pair.


I am not worried about that. I would just use char * s and document that the input should be utf8. If the data may come from an untrusted source, you still have to check the encoding at runtime.

+4
source

I suggest you use std::wstring .

Check out this other question for details.

+1
source

The ICU project provides a Unicode support library for C ++.

0
source

Source: https://habr.com/ru/post/1310479/


All Articles