Is it safe to use utf8mb4 join with utf8 columns?

I have some MySQL tables with utf8mb4 fields and others with utf8 .

Is it safe to use utf8mb4 in the PDO connection string for all tables? Or do I need to convert everything to utf8mb4 or run two different PDO connections?


EDIT: The question is not, "can I store 4-byte characters in utf8 columns?" We already know that we cannot, it does not depend on the connection, therefore, if the column is utf8, it means that it will not receive 4-byte characters, for example, country or currency codes, email addresses, usernames ... where the entry is confirmed by the application.

+11
source share
3 answers

This can be easily verified using the following script:

<?php $pdo = new PDO('mysql:host=localhost;dbname=test', 'test', ''); $pdo->exec(" drop table if exists utf8_test; create table utf8_test( conn varchar(50) collate ascii_bin, column_latin1 varchar(50) collate latin1_general_ci, column_utf8 varchar(50) collate utf8_unicode_ci, column_utf8mb4 varchar(50) collate utf8mb4_unicode_ci ); "); $latin = 'abc Γ€Ε’Γ©'; $utf8 = 'β™”β™•'; $mb4 = 'πŸ›ƒ πŸ”£'; $pdo->exec("set names utf8"); $pdo->exec(" insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4) values ('utf8', '$latin', '$latin $utf8', '$latin $utf8 $mb4') "); $pdo->exec("set names utf8mb4"); $pdo->exec(" insert into utf8_test(conn, column_latin1, column_utf8, column_utf8mb4) values ('utf8mb4', '$latin', '$latin $utf8', '$latin $utf8 $mb4') "); $result = $pdo->query('select * from utf8_test')->fetchAll(PDO::FETCH_ASSOC); var_export($result); 

And here is the result:

 array ( 0 => array ( 'conn' => 'utf8', 'column_latin1' => 'abc Γ€Ε’Γ©', 'column_utf8' => 'abc Γ€Ε’Γ© β™”β™•', 'column_utf8mb4' => 'abc Γ€Ε’Γ© β™”β™• ???? ????', ), 1 => array ( 'conn' => 'utf8mb4', 'column_latin1' => 'abc Γ€Ε’Γ©', 'column_utf8' => 'abc Γ€Ε’Γ© β™”β™•', 'column_utf8mb4' => 'abc Γ€Ε’Γ© β™”β™• πŸ›ƒ πŸ”£', ), ) 

As you can see, we cannot use utf8 as the encoding of the connection when working with utf8mb4 columns (see ???? ). But we can use utf8mb4 to connect when working with utf8 columns. Also there are no problems with writing and reading from latin or ascii columns.

The reason is that you can encode any utf8 , latin or ascii character in utf8mb4 but not vice versa. Therefore, using utf8mb4 as the character set for the connection is safe in this case.

+3
source

Short answer: NO , it is not safe.

If your data has utf8mb4 characters and you are using the MySQL utf8 charset connection, you run into problems because MySQL utf8 charset only supports BMP characters (up to 3 bytes of characters).

My recommendation is to convert all tables to utf8mb4 for full support for UTF-8. In addition, utf8mb4 is backward compatible with utf8 .

+2
source

Short answer: Yes, if you use only 3-byte (or shorter) UTF-8 characters.

Or ... No, if you are going to work with 4-byte UTF-8 characters such as πŸ˜…πŸ˜˜πŸ˜.

Long answer:

(And I will explain why no can be the right answer.)

The connection sets which encoding the client uses.

CHARACTER SET for a column (or, by default, from a table) sets which encoding can be placed in a column.

CHARACTER SET utf8 is a subset of utf8mb4 . That is, all characters acceptable for utf8 (via a join or column) are acceptable for utf8mb4 . In other words, MySQL utf8mb4 (same as the outside world of UTF-8 ) has a full 4-byte utf-8 encoding, which includes more Emoji, more Chinese, etc. than MySQL up to 3-byte utf8 (it same BMP) ")

(Technically, utf8mb4 only processes up to 4 bytes, but UTF-8 processes longer characters. However, I doubt that 5-byte characters will appear in my life.)

So, this is what happens with any 3-byte (or shorter) UTF-8 character in the client, given that Connection is utf8mb4 and the columns in the tables are only utf8: each character enters and exits the server without conversion and without errors . Note: the problem occurs on INSERT , not SELECT ; however, you may not notice the problem until you make SELECT .

But what if you have Emoji in the client? Now you will get an error. (Or a truncated line) (Or a question mark (s)) This is because a 4-byte Emoji (for example, cannot) cannot be compressed into a 3-byte "utf8" (or "1-byte latin1" or .. .).

If you are using 5.5 or 5.6, you may run into issue 767 (or 191). I will give some workarounds here . None are perfect.

Regarding inversion (utf8 join, but utf8mb4 columns): SELECT may have problems if you can get some 4-byte characters in the table.

"Official Sources" - good luck. I spent a decade trying to understand the intricacies of character processing, and then simplified them to practical suggestions. Most of the time I thought that I had answers to all the questions, only to meet another unsuccessful test case. Common cases are listed in Trouble with UTF-8 characters; What I see is not what I store. However, this does not apply directly to your question!

From the comment

 mysql> SHOW CREATE TABLE emoji\G *************************** 1. row *************************** Table: emoji Create Table: CREATE TABLE 'emoji' ( 'id' int(10) unsigned NOT NULL AUTO_INCREMENT, 'text' varchar(255) NOT NULL, PRIMARY KEY ('id') ) ENGINE=InnoDB AUTO_INCREMENT=5 DEFAULT CHARSET=utf8mb4 1 row in set (0.00 sec) mysql> insert into emoji (text) values ("abc"); Query OK, 1 row affected (0.01 sec) mysql> show variables like 'char%'; +--------------------------+----------------------------+ | Variable_name | Value | +--------------------------+----------------------------+ | character_set_client | utf8 | | character_set_connection | utf8 | | character_set_database | utf8mb4 | | character_set_filesystem | binary | | character_set_results | utf8 | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /usr/share/mysql/charsets/ | +--------------------------+----------------------------+ 8 rows in set (0.00 sec) 

It is said above that the β€œconnection” (I think the β€œclient”) uses utf8, not utf8mb4.

 mysql> insert into emoji (text) values ("πŸ˜…πŸ˜˜πŸ˜"); -- 4-byte Emoji Query OK, 1 row affected, 1 warning (0.00 sec) mysql> show warnings; +---------+------+----------------------------------------------------------------------------------+ | Level | Code | Message | +---------+------+----------------------------------------------------------------------------------+ | Warning | 1366 | Incorrect string value: '\xF0\x9F\x98\x85\xF0\x9F...' for column 'text' at row 1 | +---------+------+----------------------------------------------------------------------------------+ 1 row in set (0.00 sec) 

Now change the "connection" to utf8mb4 :

 mysql> SET NAMES utf8mb4; Query OK, 0 rows affected (0.00 sec) mysql> insert into emoji (text) values ("πŸ˜…πŸ˜˜πŸ˜"); Query OK, 1 row affected (0.01 sec) mysql> SELECT * FROM emoji; +----+--------------+ | id | text | +----+--------------+ | 1 | ? ? ? ? | | 2 | abc | | 3 | ???????????? | -- from when "utf8" was in use | 4 | πŸ˜…πŸ˜˜πŸ˜ | -- Success with utf8mb4 in use +----+--------------+ 4 rows in set (0.01 sec) 
+2
source

Source: https://habr.com/ru/post/1239812/


All Articles