What is the difference between the following three sort commands in unix?

How are the following sorting commands in unix different?

1) sort -k1,4 < file 2) sort -k1,1 -k4,4 < file 3) sort -k1,1 -k2,2 -k3,3 -k4,4 < file 

In particular, # 1 and # 2 are confusing. For example, the following example illustrates my points

 $ cat tmp 1 2 3 t 4 2 4 c 5 4 6 c 7 3 20 r 12 3 5 i 2 45 7 a 11 23 53 b 23 43 53 q 11 6 3 c 0 4 3 z $ diff <(sort -k1,4 tmp) <(sort -k1,1 -k2,2 -k3,3 -k4,4 tmp) 1a2 > 1 2 3 t 5,6d5 < 1 2 3 t < 23 43 53 q 7a7 > 23 43 53 q $diff <(sort -k1,4 tmp) <(sort -k1,1 -k4,4 tmp) 1a2 > 1 2 3 t 5,6d5 < 1 2 3 t < 23 43 53 q 7a7 > 23 43 53 q 

And I looked at the sort page. The man sort page says:

 -k, --key=POS1[,POS2] start a key at POS1 (origin 1), end it at POS2 (default end of line) 

But I do not understand this explanation. If it starts with POS1 and ends with POS2, then the shouln't # 1 and # 3 commands give the same results?

+6
source share
1 answer

The difference is that # 1 treats the entire string as one key and sorts it lexicographically. The other two have several keys, and in particular, while # 3 uses the same set of fields as # 1, it does it very differently. First, it sorts the list by the first column (the space refers to the next field and matters if you do not specify -b ), and if two or more rows have the same value in the first column, then it uses the second key to sort this subset of rows. If two or more rows are identical in the first two columns, it uses the third key, etc.

In your first case, depending on your language, you may get different results (try LC_ALL=C sort -k1,4 < file and compare it, for example, with LC_ALL=en_US.utf8 sort -k1,4 < file ).

In your second and third cases, since the keys are divided into transitions from non-white spaces to spaces. This means that the 2nd and following columns have space prefixes of various sizes, which affect the sort order, since you do not specify -b .

Also, if you have a combination of spaces and tabs to align your columns, this can mess things up.

I got the same results when I had LC_ALL=en_US.utf8 in my environment, but the expected results (i.e. no differences) using LC_ALL=C (SuSE Enterprise 11.2).

+2
source

Source: https://habr.com/ru/post/946536/


All Articles