Why is UTF-8 text sorted in a different order between OS X and Linux?

I have a text file with lines of encoded text in UTF-8 format:

mac-os-x$ cat unsorted.txt ウ foo チ 'foo' 

In case this helps to reproduce the problem, here is the checksum and dump of the exact bytes in the file, as well as how you could generate the file yourself (in Linux use base64 -d instead of -D ):

 mac-os-x$ shasum unsorted.txt a6d0b708d3e0cafb0c6e1af7450e9243da8cb078 unsorted.txt mac-os-x$ perl -ne 'print join(" ", map { sprintf "%02x", ord } split //), "\n"' unsorted.txt e3 82 a6 0a 66 6f 6f 0a e3 83 81 0a 27 66 6f 6f 27 0a e6 b4 a5 0a mac-os-x$ echo 44KmCmZvbwrjg4EKJ2ZvbycK5rSlCg== | base64 -D > unsorted.txt 

When I sort this input file on Mac OS X (regardless of whether I use GNU sort 5.93 that comes with Mac OS X Yosemite, or using Homebrew, the installed version of GNU version 8.23), I get this sorted result:

 mac-os-x$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /usr/bin/sort unsorted.txt 'foo' fooウチ津mac-os-x$ echo `sw_vers -productName` `sw_vers -productVersion` Mac OS X 10.10.1 mac-os-x$ /usr/bin/sort --version | head -1 sort (GNU coreutils) 5.93 

When I sort the same file with the same locale settings on Linux (I tested both Centos 5.5 and CentOS 6.5), I get a different result:

 linux-centos-6.5$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /bin/sort unsorted.txtウチfoo 'foo'津linux-centos-6.5$ cat /etc/redhat-release CentOS release 6.5 (Final) linux-centos-6.5$ /bin/sort --version | head -1 sort (GNU coreutils) 8.4 

Pay attention to the different locations of the Japanese Kana against the English language and to a different sort order between two lines that differ only in single quotes.

To add another option to the mix, I noticed that in the very old FreeBSD 6 field, I have the same sort order as OS X:

 freebsd-6.0$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /usr/bin/sort unsorted.txt 'foo' fooウチ津freebsd-6.0$ uname -rs FreeBSD 6.0-RELEASE freebsd-6.0$ sort --version | head -1 sort (GNU coreutils) 5.3.0-20040812-FreeBSD 

I expected that the sort order would be the same in each case, given that all cases use GNU sort, all with the same locale settings. I tried to explicitly set LC_COLLATE separately and tried to use LC_COLLATE=C to force sorting by byte order, but this did not change any results.

Why is my file input example sorted differently in OS X and Linux? And how can I get both systems to create identically sorted text (I don’t care what option if it is consistent between them)?

+6
source share
1 answer

It seems like your linux sort does not preserve the correct order of UTF-8 .

Hex UTF-8 unsorted.txt your unsorted.txt (first letters):

- 30A6

foo - 0066

- 30C1

'foo' - 0027

- 6D25

taken from http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=%E3%82%A6&mode=char

So, the correct sorting is according to Unicode sorting ( http://www.unicode.org/Public/UCA/latest/allkeys.txt ):

'foo' - line 487

foo - line 8966

- line 20875

- line 21004

- not in file

So, to answer your question, your Linux machine provides the wrong sort tables for the sort function. Unfortunately, I can’t say why this is possible.

PS: There is a similar question for your here .

EDIT

As @ninjalj noted, glibc does not use UCA, but instead uses ISO-14651. This bug report suggests switching to UCA. Unfortunately, it is still not allowed.

In addition, it may be somehow related to the issue of ls case insensivity on MacOSX. Some people even suggest that this has something to do with the HFS file system.

+1
source

Source: https://habr.com/ru/post/979380/


All Articles