SQL does not distinguish between u and ü, although the sorting is utf8mb4_unicode_ci

Question

SQL does not distinguish between u and ü, although the sorting is utf8mb4_unicode_ci

Table x has a column with u and ü values.

SELECT * FROM x WHERE column='u' .

This returns u AND ü , although I am only looking for u .

Summary of utf8mb4_unicode_ci table. Wherever I read about such problems, everyone suggests using this sorting because they say that utf8mb4 really covers ALL CHARACTERS. With this sorting, it is necessary to solve all the problems of character set and sorting.

I can insert ü , è , é , à , Chinese characters , etc. When I do SELECT * , they are also extracted and displayed correctly.

The problem only occurs when compiling two rows, as in the above example ( SELECT WHERE ), or when I use UNIQUE INDEX in a column. When I use UNIQUE INDEX , a "ü" not inserted when I already had a "u" in the column. So, when SQL compares u and ü to decide whether ü is unique, it considers that it is the same as u and does not insert ü .

I changed everything to utf8mb4 because I no longer want to worry about character sets and sorting. However, it seems that utf8mb4 not a solution when it comes to COMPARATIVE strings.

I also tried: SELECT * FROM x WHERE _utf8mb4 'ü' COLLATE utf8mb4_unicode_ci = column .
This code is executable (looks pretty complicated). However, it also returns ü AND u .

I spoke with some people in India and here in China on this issue. We have not yet found a solution.

If anyone could solve this mystery, that would be great.

Add_On: after reading all the answers and comments below, here is an example of code that solves the problem:

SELECT * FROM x WHERE 'ü' COLLATE utf8mb4_bin = column

By adding "COLLATE utf8mb4_bin" to the SELECT query, SQL is suggested to put "binary glasses" (ending with _bin) when it looks at the characters in the column. In binary glasses, SQL now sees binary code in a column. And the binary code is different for each letter and symbol and emoji you can think of. Thus, SQL can now also see the difference between u and ü. So now it only returns ü when a SELECT query searches for ü and does not return u.

Thus, you can leave everything (sorting the database, matching tables) the same, but only add "COLLATE utf8mb4_bin" to the query when exact differentiation is required.

(Actually, SQL removes all other points (utf8mb4_german_ci, _general_ci, _unicode_ci, etc.) and does only what it does when it is not forced to do anything extra. It just looks at the binary code and does not correct its search in any special cultural context.)

Thanks to everyone for their support, especially from Pred.

+5

sql mysql utf-8 utf utf8mb4

Jakob Dec 12 '16 at 13:50

source share

4 answers

Pred · Answer 1 · 2016-12-12T15:31:12+0000

Sorting and character set are two different things.

A character set is simply an “unordered" list of characters and their representation. utf8mb4 is a character set and spans many characters.

Collation determines the order of the characters (determines the end result of the order, for example) and defines other rules (for example, which characters or combinations of characters should be considered the same). Collaborations are made from character sets; there can be more than one sort for the same character set. (This is a character set extension - sorta)

In utf8mb4_unicode_ci all (most?) utf8mb4_unicode_ci characters are treated as the same character, so you get u and ü . In short, this comparison is an asymmetric comparison with an accent.

This is similar to the fact that German comparisons consider ss and ß as such.

utf8mb4_bin is another sort, and it treats all characters as different. You may or may not want to use it by default, it is up to you and your business rules.

You can also convert sorting in queries, but keep in mind that this will prevent the use of MySQL indexes.

Here's an example using a similar, but perhaps a bit more familiar part of sorts:

ci at the end of the sort means Case Insensitive , and almost all comparisons to ci have a pair ending in cs , which means Case Sensitive .

When your column is case insensitive, the where column = 'foo' condition will find it all: foo Foo fOo FoO FOo FoO fOO, FOO.

Now, if you try to set case-sensitive matching (for example utf8mb4_unicode_cs ), all the above values are treated as different values.

Localized mappings (e.g. German, British, American, Hungarian, etc.) comply with named language rules. In Germany, ss and ß same, and this is stated in the rules of the German language. When a German user searches for the value of Straße , he expects that the software (supporting German or written in Germany) will return both Straße and Strasse .

To go further when it comes to ordering, two words are the same, they are equal, their meaning is the same, so there is no definite order.

Remember that a UNIQUE constraint is just a way to organize / filter values. Therefore, if there is a unique key defined in the column with German matching, it will not allow inserting both Straße and Strasse , since according to the rules of the language they should be considered as equal.

Now let's take a look at our original sort: utf8mb4_unicode_ci . This is a universal sort, which means that it tries to simplify everything, since ü not a very common character, and most users do not know how to enter it, this mapping makes it equal to u . This is a simplification to support most languages, but as you already know, these simplifications have some side effects. (e.g. organizing, filtering, using unique constraints, etc.).

utf8mb4_bin is the other end of the spectrum. This mapping should be as rigorous as possible. To achieve this, he literally uses character codes to distinguish between characters. This means that each character form is different, this sorting is implicitly case sensitive and accent sensitive.

Both of them have disadvantages: localized and general mappings are designed for one specific language or to provide a common solution. ( utf8mb4_unicode_ci is the "extension" of the old utf8_general_ci sort)

The binary requires special care when interacting with the user. Since these are cs and AS , this can confuse the users who are used to get the value “Foo” when they look for the value “foo”. Also as a developer, you have to be careful when it comes to joins and other functions. INNER JOIN 'foo' = 'Foo' will not return anything, since 'foo' is not equal to 'Foo'.

I hope these examples and explanations help a little.

Rick james · Answer 2 · 2016-12-13T20:05:07+0000

utf8_collations.html lists which letters are "equal" in different utf8 (or utf8mb4) commands. With rare exceptions, all accents are removed before comparison in any sorting ..._ci . Some of the exceptions are language specific, not Unicode at all. Example: Icelandic É > E

..._bin is the only collation that distinguishes letters with accented letters as different. The same goes for folding cases.

If you are comparing a lot, you should change the column sorting to ..._bin . When using the COLLATE in WHERE index cannot be used.

Note on ß . ss = ß almost all comparisons. In particular, utf8_general_ci (which was previously considered default) considered them to be unequal. This sorting made no effort to treat any two-letter combination ( ss ) as a single “letter”. Also, due to a bug in 5.0, utf8_general_mysql500_ci treats them unevenly.

Go utf8mb4_unicode_520_ci , utf8mb4_unicode_520_ci is the best version 5.7. For 8.0 utf8mb4_0900_ai_ci "better." "520" and "900" are Unicode standards, so there may be even more new ones in the future.

Veer Abheek Singh Manhas · Answer 3 · 2016-12-12T14:04:18+0000

You can try utf8_bin sorting, and you should not run into this problem, but it will be case sensitive. Comparison of beans is strictly compared, only separating characters according to the selected encoding, and once this is done, comparisons are performed on a binary basis, just as many programming languages will compare strings.

Sea coast of tibet · Answer 4 · 2017-02-21T14:39:51+0000

I will just add to the other answers that _bin sorting also has its own characteristics.

For example, after the following:

 CREATE TABLE `dummy` (`key` VARCHAR(255) NOT NULL UNIQUE); INSERT INTO `dummy` (`key`) VALUES ('one');

this will fail:

 INSERT INTO `dummy` (`key`) VALUES ('one ');

This is described in Binary Sort compared to _bin Collations .

Change I posted a related question here .

SQL does not distinguish between u and ü, although the sorting is utf8mb4_unicode_ci

More articles: