Removing lines containing a unique first field with awk?

Question

Removing lines containing a unique first field with awk?

Looking at printing only lines with a double first field. for example, from data that looks like this:

1 abcd 1 efgh 2 ijkl 3 mnop 4 qrst 4 uvwx

Must print:

 1 abcd 1 efgh 4 qrst 4 uvwx

(FYI - the first field is not always 1 character in my data)

+4

sorting grep awk sed uniq

Kyle Feb 25 '11 at 23:24

source share

5 answers

Here is some awk code to do what you want, assuming the input is grouped by its first field already (for example, uniq also required):

 BEGIN {f = ""; l = ""} { if ($1 == f) { if (l != "") { print l l = "" } print $0 } else { f = $1 l = $0 } }

In this code, f is the previous value of field 1, and l is the first line of the group (or empty if it has already been printed).

+1

Jeremiah willcock Feb 25 '11 at 23:38

source share

 BEGIN { IDLE = 0; DUP = 1; state = IDLE } { if (state == IDLE) { if($1 == lasttime) { state = DUP print lastline } else state = IDLE } else { if($1 != lasttime) state = IDLE } if (state == DUP) print $0 lasttime = $1 lastline = $0 }

+1

Digitaloss Feb 25 '11 at 23:41

source share

Assuming the ordered input you specify in your question:

 awk '$1 == prev {if (prevline) print prevline; print $0; prevline=""; next} {prev = $1; prevline=$0}' inputfile

The file needs to be read only once.

0

Dennis williamson Feb 26 '11 at 1:33

source share

If you can use Ruby (1.9+)

 #!/usr/bin/env ruby hash = Hash.new{|h,k|h[k] = []} File.open("file").each do |x| a,b=x.split(/\s+/,2) hash[a] << b end hash.each{|k,v| hash[k].each{|y| puts "#{k} #{y}" } if v.size>1 }

output:

 $ cat file 1 abcd 1 efgh 2 ijkl 3 mnop 4 qrst 4 uvwx 4 asdf 1 xzzz $ ruby arrange.rb 1 abcd 1 efgh 1 xzzz 4 qrst 4 uvwx 4 asdf

0

kurumi Feb 26 '11 at 3:41

source share

Siegex · Accepted Answer · 2011-02-25T23:33:04+0000

 awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile

Yes, you give it the same file as the input twice. Since you do not know in advance if the current record is uniq or not, you create an array based on $1 in the first pass, then you only output records that saw $1 more than once in the second pass.

I'm sure there are ways to do this in just one pass through the file, but I doubt they will be as clean

Explanation

FNR==NR : This is true only when awk reads the first file. It essentially checks the total number of records seen (NR) and the input record in the current file (FNR).
a[$1]++ : create an associative array a , which is the first field ( $1 ), and the value of which increases each time you look at it.
next : ignore the rest of the script, if this is achieved, start with a new entry
(a[$1] > 1) This will be evaluated only on the second pass ./infile , and it prints only those records in which we saw the first field ( $1 ) more than once. This is essentially a shorthand for if(a[$1] > 1){print $0}

Proof of concept

 $ cat ./infile 1 abcd 1 efgh 2 ijkl 3 mnop 4 qrst 4 uvwx $ awk 'FNR==NR{a[$1]++;next}(a[$1] > 1)' ./infile ./infile 1 abcd 1 efgh 4 qrst 4 uvwx

Removing lines containing a unique first field with awk?

Explanation

Proof of concept

More articles: