Linux merge utility complains that input file is not sorted

Question

Linux merge utility complains that input file is not sorted

I have two files:

file1 has the format:

field1;field2;field3;field4

(file1 is not initially sorted)

file2 has the format:

 field1

(file2 sorted)

I run the following two commands:

 sort -t\; -k1 file1 -o file1 # to sort file 1 join -t\; -1 1 -2 1 -o 1.1 1.2 1.3 1.4 file1 file2

I get the following message:

 join: file1:27497: is not sorted: line_which_was_identified_as_out_of_order

Why is this happening?

(I also tried to sort file1 based on the entire line, not only the first line of the line, but without success)

sort -t\; -c file1 sort -t\; -c file1 does not output anything. Around line 27497, the situation is really strange, which means that sorting is not doing its job correctly:

  XYZ113017;... line 27497--> XYZ11301;... XYZ11301;...

+5

sorting linux join bash text-processing

Razvan Aug 21 '14 at 16:48

source share

2 answers

Wumpus Q. Wumbley · Answer 1 · 2014-08-21T17:10:13+0000

sort -k1 uses all fields starting from field 1 as the key. You need to specify a stop field.

 sort -t\; -k1,1

mklement0 · Answer 2 · 2015-03-31T18:17:14+0000

To complement the helpful answer of Wumpus Q. Wumbley from a broader perspective (since I found this post exploring a slightly different issue).

When join used , the input files should only be sorted by the join field , otherwise you may see a warning reported by the OP.

There are two general scenarios that mistakenly take more than an interest field when sorting input files:

If you specify a field, it is easy to forget that you must also specify a stop field, even if you only target one field , because sort uses the rest if only the start field is specified; eg:.
- sort -t, -k1 ... # !! FROM field 1 THROUGH THE REST OF THE LINE
- sort -t, -k1,1 ... # Field 1 only
If your sort field is a FIRST field in the input , it does not want to specify any field selector at all .
- However, if the field values can be prefixes of each other's substrings, sorting entire strings NOT (necessarily) leads to the same sorting order as sorting by the 1st field :
- sort ... # NOT always the same as 'sort -k1,1'! see below for example

Pitfall example:

 #!/usr/bin/env bash # Input data: fields separated by '^'. # Note that, when properly sorting by field 1, the order should # be "nameA" before "nameAA" (followed by "nameZ"). # Note how "nameA" is a substring of "nameAA". read -r -d '' input <<EOF nameA^other1 nameAA^other2 nameZ^other3 EOF # NOTE: "WRONG" below refers to deviation from the expected outcome # of sorting by field 1 only, based on mistaken assumptions. # The commands do work correctly in a technical sense. echo '--- just sort' sort <<<"$input" | head -1 # WRONG: 'nameAA' comes first echo '--- sort FROM field 1' sort -t^ -k1 <<<"$input" | head -1 # WRONG: 'nameAA' comes first echo '--- sort with field 1 ONLY' sort -t^ -k1,1 <<<"$input" | head -1 # ok, 'nameA' comes first

Explanation:

If you do NOT restrict sorting in the first field, this is the relative character sort order. ^ and A (column index 6), which matters in this example. In other words: the field separator is compared with the data that is the source of the problem: ^ has a HIGH ASCII value than A , and therefore is sorted after "A", the result is a line starting with nameAA^ sorting BEFORE those who have nameA^ .
Note: It is possible that problems may occur on one platform, but mask on another , depending on the locale and character set settings and / or the implementation of sort ; for example, with the locale en_US.UTF-8 , with , as a separator and - valid internal fields:
- sort , used in OSX 10.10.2 (old version of GNU sort , 5.93) sorts , to - (according to ASCII values)
- sort , used in Ubuntu 14.04 (GNU sort 8.21), does the opposite: sorts - to , ^[1]

^{[1] I don’t know why - if someone knows, tell me.} ^{Test with sort <<<$'-\n,'}

Linux merge utility complains that input file is not sorted

More articles: