Difference between utf8 sorting between Unicode and Danish

OK Hello. I am changing the database encoding from latin1_sweedish_ci to utf8 . I always used utf8_danish_ci because it is closest to the Norwegian style. I think so. But what about utf8_general_ci and utf8_unicode_ci ?

Some time ago; It was preferable to use _general_ci for better / faster work, and _unicode_ci for more accuracy due to the fact that the sorting algorithm is more complex in the second. But since speed / performance is no longer a problem - or not so much a problem in most cases - is _unicode_ci normal to use in most situations?

But how is _unicode_ci different from _danish_ci ?
These are the last three letters Γ¦ , ΓΈ , Γ₯ in the Nordic alphabet, which is taken into account?

In most comparisons (one against the other) I can only find between _general_ci and _unicode_ci .

Does anyone know of any examples when to use _unicode_ci or when to use _danish_ci would be much appreciated ...

+4
source share
4 answers

In short, if your application is multilingual and stores multiple languages ​​in the same tables, you are mostly stuck and should worry about sorting / matching outside the database. Then utf8_general_ci is as good as any other.

If it supports only one language, you will do your best by setting the correct sorting at db level - in your case, utf8_danish_ci is valid, since it is the same as Norwegian, if Wikipedia is something that needs to be done.

If you want to know more about matching, there are vivid examples in the ICU docs of how thorny this stuff is. Quoting widely:

http://userguide.icu-project.org/collation

[H] ere - some of the ways to change languages ​​in ordering lines:

The letters AZ can be sorted in a different order than in English. For example, in Lithuanian, β€œy” is sorted between β€œi” and β€œk”.

Combinations of letters can be interpreted as if they were a single letter. For example, in traditional Spanish β€œch” is treated as a single letter, and sorted between β€œc” and β€œd”.

Accented letters can be considered as secondary options for unsuccessful writing. For example, "Γ©" can be considered equivalent to "e".

Accented letters can be considered as different letters. For example, β€œΓ…β€ in Danish is considered as a separate letter, which is sorted immediately after β€œZ”.

Unauthorized letters that are considered distinct in one language may be fuzzy in another. For example, the letters "v" and "w" are equal to two different letters in accordance with English. However, β€œv” and β€œw” are considered variants of the forms of the same letter in Swedish.

The letter can be interpreted as if it were two letters. For example, the traditional German β€œΓ€β€ is compared as if it were β€œae”.

Thai requires some letters to be reversed.

French requires letters to be sorted with accents at the end of the line, and sorted before accents at the beginning of the line. For example, the word "cΓ΄te" is sorted to "cotΓ©" because the acute emphasis on the last "e" is more significant than the envelope on "o".

Sometimes lowercase letters are sorted to uppercase. Feedback is required in other situations. For example, lowercase letters: usually sorted to capital letters in English. Latvian letters are the exact opposite.

Even in the same language, different applications and different sort orders may be required. For example, in German dictionaries "ΓΆf" will be preceded by "from". In phone books, the situation is exactly the opposite.

The sort order may change over time due to government regulations or new Unicode characters / scripts.

+5
source

Please remember that Collation! = Encoding.

Coding is a juxtaposition between integers (which is all that a database can store at the end of the day) and human readable graphical representations of characters.

Collation is an ordering rule used to sort characters according to the normal alphabetical order of a given language. Note that this ordering does not reflect the actual order of the internal, numerical representation.

Your question boils down to the following: what alphabetical order should you use in your application? This cannot be answered.

+2
source

I'm not 100% sure, but I believe utf8_danish_ci is a subset of (or) utf8 (sorting).

However, if your utf8 database is encoded, it makes no sense to use Danish sorting.

Quick test (as I am in a hurry and I cannot find the sort list for utf8_unicode ):

  • create a table with all these characters (both lower and upper case) with the convolution utf8_danish_ci
  • select all records sorted by char ASC
  • change the sort order of the table to utf8_general_ci or preferably utf8_unicode_ci
  • if the characters have the same order in both queries, choose either, it does not matter

Linked link.


UPDATE

My hypothesis was wrong.

I did some tests and apparently utf8_unicode_ci not sort in the same order, so never think.

-1
source

Collation defines both characters that can be stored in a table, as well as the order of characters. Choosing anything starting with utf8 should cover most character storage needs, so utf8_general_ci is a good choice. If you intend to focus on one language, you can choose a local sort, such as utf8_danish_ci, which means that the order will be Danish and case sensitive (part ci).

For a multilingual application, you can store fields using utf8_general_ci, and when you need a specific sorting or comparison according to your preferred language, add the word COLLATE to your query with your preferred mapping from https://dev.mysql.com/doc/refman /5.6/en/charset-unicode-sets.html

The answer provided by @Denis above, claiming that you cannot sort in MySQL, is wrong in my experience.

-1
source

Source: https://habr.com/ru/post/1480659/


All Articles