How to get MySQL to process SQLite-like strings regarding Unicode and sorting?

Question

How to get MySQL to process SQLite-like strings regarding Unicode and sorting?

I have been studying this question for several hours, on SO, in MySQL docs and elsewhere, but still cannot find a satisfactory solution. The problem is this:

What is the easiest way to get MySQL to process strings in the same way as SQLite, without any extra smart conversions?

For example, in SQLite, the following works fine:

CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE); INSERT INTO `dummy` (`key`) VALUES ('one'); INSERT INTO `dummy` (`key`) VALUES ('one '); INSERT INTO `dummy` (`key`) VALUES ('One'); INSERT INTO `dummy` (`key`) VALUES ('öne'); SELECT * FROM `dummy`;

However, in MySQL with the following settings:

 [client] default-character-set = utf8mb4 [mysql] default-character-set = utf8mb4 [mysqld] character-set-client-handshake = FALSE character-set-server = utf8mb4 collation-server = utf8mb4_bin

and the following CREATE DATABASE statement:

 CREATE DATABASE `dummydb` DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_bin;

It still does not work in the second INSERT .

I would rather simplify the declarations of string columns as much as possible, so SQLite TEXT is ideal. It sounds like VARBINARY is the way to go , but I would still like to hear your opinion about any other, potentially better options .

Addendum : SHOW CREATE TABLE dummy output

 mysql> SHOW CREATE TABLE dummy; +-------+----------------------------------------------------- | Table | Create Table +-------+----------------------------------------------------- | dummy | CREATE TABLE `dummy` ( `key` varchar(255) COLLATE utf8mb4_bin NOT NULL, UNIQUE KEY `key` (`key`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin | +-------+----------------------------------------------------- 1 row in set (0.00 sec)

+2

mysql unicode collation

Sea coast of tibet Feb 21 '17 at 15:18

source share

2 answers

The approach shown in the question should (basically) work very well in MySQL for the following reasons:

Mapping (not to be confused with encoding) is a set or rules that determine how to sort and compare characters commonly used for replication at the database level of user expectations in terms of culture (if I'm looking for cafe I also expect to find a café ).
Sorting plays an important rule over unique constraints, because it establishes a definition of unique.
Binary mappings are specifically designed to ignore cultural rules and work at the byte level, so utf8mb4_bin is the right choice here.
MySQL allows you to establish a combination of coding and collation with column-level granularity.
If there is no sorting in the column definition, it will use a level one table.
If there is no sorting in the table definition, it will use database level one.
If there is no sorting in the database definition, it will use server level one.

It is also worth noting that MySQL will convert between encodings transparently if:

Correct connection coding established.
The conversion is physically possible (for example, all source characters also refer to target encoding).

For this last reason, VARBINARY may not be the best choice for a column that is still text because it opens the door to get the café stored from a connection configured to use ISO-8859-1 and cannot retrieve it correctly from the connection configured to use UTF-8.

Side Note: The table definition shown may cause the following error:

ERROR 1071 (42000): The specified key was too long; The maximum key length is 767 bytes.

Indexes can have a relatively small maximum size. From docs :

If innodb_large_prefix is enabled (by default), the index key prefix limit is 3072 bytes for InnoDB tables that use DYNAMIC or COMPRESSED row format. If innodb_large_prefix is disabled, the limit key index prefix is 767 bytes for tables of any row format.
innodb_large_prefix is deprecated and will be removed in a future release. innodb_large_prefix was introduced in MySQL 5.5 to disable large key-key prefixes for compatibility with earlier versions of InnoDB that do not support large-index prefixes.
The index key prefix length limit is 767 bytes for InnoDB tables that use the REDUNDANT or COMPACT format. For example, you can click this limit with a column prefix index of more than 255 characters per TEXT or VARCHAR, assuming a character set of utf8mb3 and a maximum of 3 bytes for each character.
Attempting to use an index key prefix length that exceeds the limit returns an error. To avoid such errors in replication configurations, do not enable innodb_large_prefix on the master device, if it also cannot be activated on slaves.

Since utf8_mb8 allocates 4 bytes per character, the 767 limit will be full of only 192 characters.

We have one more problem:

 mysql> CREATE TABLE `dummy` ( -> `key` varchar(191) COLLATE utf8mb4_bin NOT NULL, -> UNIQUE KEY `key` (`key`) -> ) -> ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_bin; Query OK, 0 rows affected (0.01 sec) mysql> INSERT INTO `dummy` (`key`) VALUES ('one'); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO `dummy` (`key`) VALUES ('one '); ERROR 1062 (23000): Duplicate entry 'one ' for key 'key'

Sorry?

 mysql> INSERT INTO `dummy` (`key`) VALUES ('One'); Query OK, 1 row affected (0.00 sec) mysql> INSERT INTO `dummy` (`key`) VALUES ('öne'); Query OK, 1 row affected (0.00 sec) mysql> SELECT * FROM `dummy`; +-----+ | key | +-----+ | One | | one | | öne | +-----+ 3 rows in set (0.00 sec)

This last issue is the interesting subtlety of MySQL mappings. From docs :

All MySQL mappings are of type PADSPACE. This means that all CHAR, VARCHAR, and TEXT values in MySQL are compared without any trailing spaces . The “comparison” in this context does not include the LIKE-template operator for which trailing spaces are significant
[...] For cases where the characters of the closing pad are separated or comparisons ignore them, if the column has an index that requires a unique value, inserting into the column values that differ only in the number of characters at the end of the line will lead to an error with a duplicate key .

I would dare say that the VARBINARY type is the only way to overcome this ...

+1

Álvaro González Feb 22 '17 at 8:48

source share

Rick james · Accepted Answer · 2017-02-21T17:09:40+0000

MySQL wants to convert strings when executing INSERT and SELECT . The conversion occurs between what you declare to the client and what is declared in the column.

The only way to avoid this is: VARBINARY and BLOB instead of VARCHAR and TEXT .

Using COLLATION utf8mb4_bin does not preclude conversion to / from CHARACTER SET utf8mb4 ; he simply says that WHERE and ORDER BY should compare bits instead of dealing with accents and folding phrases.

Keep in mind that CHARACTER SET utf8mb4 is a way to encode text; COLLATION utf8mb4_* - these are the rules for comparing texts in this encoding. _bin is simple.

UNIQUE involves comparison for equality, therefore, COLLATION . In most calculations, utf8mb4 3 (without spaces) are compared equal. utf8mb4_bin will treat 3 as different. utf8mb4_hungarian_ci considers one = one> öne.

Trailing spaces are controlled by the column data type ( VARCHAR or otherwise). The latest version even has a parameter related to the consideration of trailing spaces.

How to get MySQL to process SQLite-like strings regarding Unicode and sorting?

More articles: