MySQL CHAR () Function and UTF8 Output?

  + -------------------------- + ---------------------- ---------------------------------- +
 |  Variable_name |  Value |
 + -------------------------- + ---------------------- ---------------------------------- +
 |  character_set_client |  utf8 |
 |  character_set_connection |  utf8 |
 |  character_set_database |  utf8 |
 |  character_set_filesystem |  binary |
 |  character_set_results |  utf8 |
 |  character_set_server |  utf8 |
 |  character_set_system |  utf8 |
 |  character_sets_dir |  /usr/local/mysql-5.1.41-osx10.5-x86_64/share/charsets/ |
 + -------------------------- + ---------------------- ---------------------------------- +
 8 rows in set (0.00 sec)

 mysql> select version ();
 + ----------- +
 |  version () |
 + ----------- +
 |  5.1.41 |
 + ----------- +
 1 row in set (0.00 sec)

 mysql> select char (0x00FC);
 + -------------- +
 |  char (0x00FC) |
 + -------------- +
 |  ?  |
 + -------------- +
 1 row in set (0.00 sec) 

Pending actual character utf8 → "ü" instead of "?" I tried char (0x00FC using utf8) but did not go.

Using mysql version 5.1.41

Everything that is on Google cannot find anything on this. MySQL docs just say multibyte output is expected with values ​​greater than 255 after mysql version 5.0.14.

thanks

+5
source share
2 answers

You enter UTF-8 in Unicode.

0x00FC is the Unicode code point for ü:

mysql> select char(0x00FC using ucs2); +----------------------+ | char(0x00FC using ucs2) | +----------------------+ | ü | +----------------------+ 

In UTF-8 encoding, 0x00FC is represented by two bytes :

 mysql> select char(0xC3BC using utf8); +-------------------------+ | char(0xC3BC using utf8) | +-------------------------+ | ü | +-------------------------+ 

UTF-8 is just a character encoding method Unicode in binary form. It is designed to make efficient use of space, so ASCII characters only accept one byte, and iso-8859-1 characters, such as ü, only accept two bytes. Some other characters take three or four bytes, but they are much less common.

+7
source

Add the answer to Martin :

  1. You can use the "input" instead of the CHAR() function. To do this, an encoding with an underscore prefix is ​​indicated before the code point:

     _utf16 0xFC 

    or:

     _utf16 0x00FC 
  2. If the goal is to specify a code point instead of an encoded sequence of bytes, then you need to use an encoding in which the value of the code point is an encoded byte sequence. For example, as shown in Martin's answer, 0x00FC is both the code point value for ü and the encoded sequence of bytes for ucs2 / utf16 (they actually represent the same encoding for BMP characters, but I prefer to use "utf16" since it matches with " utf8 "and" utf32 ", according to the theme of" utf ").

    But utf16 only works for BMP characters (code points U + 0000 - U + FFFF) in terms of specifying a code point value. If you need an extra character, you need to use utf32 encoding. _utf32 0xFC not only returns ü , but also:

     _utf32 0x1F47E 

    returns: 👾

For more information about these options, as well as Unicode escape sequences for other languages ​​and platforms, please see my post:

Unicode Escape sequences across languages ​​and platforms (including extra characters)

0
source

Source: https://habr.com/ru/post/1303184/


All Articles