I have a text file with lines of encoded text in UTF-8 format:
mac-os-x$ cat unsorted.txt ウ foo チ 'foo' 津
In case this helps to reproduce the problem, here is the checksum and dump of the exact bytes in the file, as well as how you could generate the file yourself (in Linux use base64 -d
instead of -D
):
mac-os-x$ shasum unsorted.txt a6d0b708d3e0cafb0c6e1af7450e9243da8cb078 unsorted.txt mac-os-x$ perl -ne 'print join(" ", map { sprintf "%02x", ord } split //), "\n"' unsorted.txt e3 82 a6 0a 66 6f 6f 0a e3 83 81 0a 27 66 6f 6f 27 0a e6 b4 a5 0a mac-os-x$ echo 44KmCmZvbwrjg4EKJ2ZvbycK5rSlCg== | base64 -D > unsorted.txt
When I sort this input file on Mac OS X (regardless of whether I use GNU sort 5.93 that comes with Mac OS X Yosemite, or using Homebrew, the installed version of GNU version 8.23), I get this sorted result:
mac-os-x$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /usr/bin/sort unsorted.txt 'foo' fooウチ津mac-os-x$ echo `sw_vers -productName` `sw_vers -productVersion` Mac OS X 10.10.1 mac-os-x$ /usr/bin/sort --version | head -1 sort (GNU coreutils) 5.93
When I sort the same file with the same locale settings on Linux (I tested both Centos 5.5 and CentOS 6.5), I get a different result:
linux-centos-6.5$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /bin/sort unsorted.txtウチfoo 'foo'津linux-centos-6.5$ cat /etc/redhat-release CentOS release 6.5 (Final) linux-centos-6.5$ /bin/sort
Pay attention to the different locations of the Japanese Kana against the English language and to a different sort order between two lines that differ only in single quotes.
To add another option to the mix, I noticed that in the very old FreeBSD 6 field, I have the same sort order as OS X:
freebsd-6.0$ env -i LANG=en_US.utf-8 LC_ALL=en_US.utf-8 /usr/bin/sort unsorted.txt 'foo' fooウチ津freebsd-6.0$ uname -rs FreeBSD 6.0-RELEASE freebsd-6.0$ sort --version | head -1 sort (GNU coreutils) 5.3.0-20040812-FreeBSD
I expected that the sort order would be the same in each case, given that all cases use GNU sort, all with the same locale settings. I tried to explicitly set LC_COLLATE
separately and tried to use LC_COLLATE=C
to force sorting by byte order, but this did not change any results.
Why is my file input example sorted differently in OS X and Linux? And how can I get both systems to create identically sorted text (I don’t care what option if it is consistent between them)?