How UTF-8 encodes a character / string

Question

How UTF-8 encodes a character / string

I use the Twitter API library to post status to Twitter. Twitter requires the message to be UTF-8 encoded. The library contains a function that URL encodes a standard string that works great for all special characters like! @ # $% ^ & * (), but is the wrong encoding for accented characters (and other UTF-8s).

For example, “é” is converted to “% E9” and not “% C3% A9” (it is converted to a large extent only to hexadecimal value). Is there a built-in function that can input something like "é" and return something like "% C9% A9"?

edit: I am new to UTF-8 if what I request does not make sense.

edit: if i have

string foo = "bar é";

I would like to convert it to

 "bar %C3%A9"

thanks

+5

c ++ string utf-8 character-encoding twitter

tom Feb 22 '11 at 19:40

source share

2 answers

To understand what needs to be done, you must first understand a little background. Different encodings use different values for the "same" character. For example, Latin-1 says that “é” is one byte with a value of E9 (hex), while UTF-8 says that “é” is a two-byte sequence of C3 A9, and yet UTF-16 says, that the same character is the only two-byte value 00E9 - one 16-bit value, not two 8-bit values, as in UTF-8. (Unicode, which is not an encoding, actually uses the same code point value, U + E9, as Latin-1.)

To convert from one encoding to another, you must first take the encoded value, decode it to a value independent of the source encoding (i.e., the Unicode code point), and then transcode it to the target encoding. If the target encoding does not support all the source code encoding code points, you need to either translate or otherwise handle this condition.

This step of re-encoding requires knowledge of both the source and target encodings.

Your API function does not convert encodings; this seems to be URL escaping of an arbitrary byte string. Apparently, the authors of the function assume that you have already converted to UTF-8.

To convert to UTF-8, you need to know what encoding your system uses and be able to match it with Unicode codes. From there, UTF-8 encoding is trivial.

Depending on your system, this can be as simple as converting a “native” character set (which has an é like E9 for you, possibly Windows-1252, Latin-1 or something very similar) to wide characters ( probably UTF-16 or UCS-2 if sizeof (wchar_t) is 2 or UTF-32 if sizeof (wchar_t) is 4) and then to UTF-8. Wcstombs, Martin says, can handle the second part of this conversion, but it depends on the system. However, I believe that Latin-1 is a subset of Unicode, so converting from this source encoding may skip a wide characteristic step. Windows-1252 is close to Latin-1, but replaces some control characters with printed characters.

+6

Fred nurk Feb 22 '11 at 20:54

source share

Martin stone · Accepted Answer · 2011-02-22T19:57:25+0000

If you have a wide character string, you can encode it in UTF8 using the standard wcstombs () function . If you have this in some other encoding (e.g. Latin-1), you will first need to decode it to a wide string.

Edit: ... but wcstombs () depends on your locale settings, and it looks like you cannot select the UTF8 locale on Windows . (You are not saying which OS you are using.) WideCharToMultiByte () may be more useful on Windows, since you can specify the encoding in the call.

How UTF-8 encodes a character / string

More articles: