PostgreSQL incorrectly sorts unicode characters with Czech matching

I have a table with sorting settings for cs_CZ (Czech):

Name | Encoding | Collation | CType -----------+----------+-------------+------------- foo | UTF8 | cs_CZ.UTF-8 | cs_CZ.UTF-8 

but when I order by line, the result is not sorted, as it should be in the Czech alphabet:

 => SELECT surname FROM foo ORDER BY surname; surname ----------------- A Da Ďb Dc E 

Thus, it is sorted as if the Unicode accent character (Ď) was converted to its ASCII version without an accent (D). But the Czech alphabet: ... C → D → Ď → E ..., so the returned order is incorrect (in this example it should be: A → Da → Dc → Ďb → E).

Is this the usual behavior of PostgreSQL? Is there any way how to sort it correctly in the Czech alphabet?

EDIT: Tried Postgres 9.1.4, both have the same behavior. This is an Arch Linux machine.
EDIT2: Corrected example, Ď is the real problem.

+6
source share
3 answers

It is right. The emphasis on á, ď, é, ě, í, ň, ó, ť, ú, ů, ý should be ignored, see article

Czech sorting rules are a bit complicated :)

+4
source

PostgreSQL does not have its own sorting rules; it uses the rules provided by the operating system. If you try with /usr/bin/sort with the same locale, you will get the same sort order.

Here is the result with your sample data when trying with Ubuntu 12.04, PostgreSQL 9.1:

 create COLLATION cs_CZ (locale="cs_CZ.UTF-8"); select * from (values('Ca'),('Čb'),('Cc')) as l(a) order by a collate cs_CZ; 

Result:

  a  
 ----
  Ca
  Cc
  Čb
 (3 rows)

Please note that it is sorted at your discretion.

If your operating system is sorted differently, and you are sure that it is incorrect in accordance with the official Czech rules, then this is a mistake in its implementation in the Czech language.

UPDATE the following comment:

  SELECT * FROM (values('A'),('Da'),('Ďb'),('Dc'),('E')) AS l(a) ORDER BY a COLLATE cs_CZ; 

leads to:

  a  
 ----
  A
  Da
  Ďb
  Dc
  E
+3
source

sorting in Czech matching is true Czech grammar rules!

Characters such as á, ď, é, ě, í, ň, ó, ť, ú, ů, ý are sorted as if they do not have punctuation marks, so the result is:

A, Da, Ďb, Dc, E is a Czech grammar tuple.

For the Slovak and Czech languages, this may seem crazy, but "rules as rules."

Other rules for the Slovak language (collate sk_SK) , where the characters d-ď, t-ť, n-ň, l-ľ in alphabetical order, for example, Czech Ď.

+1
source

Source: https://habr.com/ru/post/946192/


All Articles