Using grep with a template file: printing single and duplicate entries

Question

Using grep with a template file: printing single and duplicate entries

Let me start by saying that I do not want to print only repeating lines and do not want to delete them.

I am trying to use grep with a template file to parse a large data file.

A template file, for example, might look like this:

1243 1234 1234 1234 1354 1356 1356 1677

etc .. with more single and duplicate entries.

The input file may look like this:

 aatta 1243 qqqqqq yyyyy 1234 vvvvvv ttttt 1555 bbbbbb ppppp 1354 pppppp yyyyy 3333 zzzzzz qqqqq 1677 eeeeee iiiii 4444 iiiiii

etc .. for 27,000 lines.

when i use

 grep -f 'Patternfile.txt' 'Inputfile.txt' > 'Outputfile.txt'

I get an output file that resembles this:

 aatta 1243 qqqqqq yyyyy 1234 vvvvvv ppppp 1354 pppppp

how could I make it also report duplicates so that I get something like this?

 aatta 1243 qqqqqq yyyyy 1234 vvvvvv yyyyy 1234 vvvvvv yyyyy 1234 vvvvvv ppppp 1354 pppppp qqqqq 1677 zzzzzz

In addition, I would also like to print an empty line if the request in the template file does not match the substring in the input file.

Thanks!

+4

grep line-breaks

Plutonicfriend Mar 26 '12 at 19:44

source share

2 answers

You are not so much grep ing for patterns as you are left - you combine the data in the data input into the template.

You can (basically) do this with join , a handy Unix utility that I know very well since I was trying to solve a problem similar to yours.

There are a few small differences.

First, the command:

 join -a 1 -2 2 <(sort Patternfile.txt) <(sort -k2,3 Inputfile.txt)

And the explanation:

-a 1 also means including unlinked lines from file 1 (Patternfile.txt). I added this because you wanted to include “empty” lines for unmatchable lines, and this was the closest I could get.
-2 2 means the union in field 2 for file 2 (you can set the field for -1 FIELD and -2 FIELD , by default - field 1). This is because the key you are connecting to in Inputfile.txt is in the second column
<(sort Patternfile.txt) - files must be sorted in the union field for the connection to work properly.
<(sort -k2,2 Inputfile.txt) - sort the input file from key 2 to key 2, inclusive

Conclusion:

 1234 yyyyy vvvvvv 1234 yyyyy vvvvvv 1234 yyyyy vvvvvv 1243 aatta qqqqqq 1354 ppppp pppppp 1356 1356 1677 qqqqq eeeeee

Differences

Slight differences between the indicated result and this result:

It is sorted in order of keys.
Unbound strings still contain the source key. If this is a problem, you can clear the unsurpassed lines by routing through a simple awk :
```
 ... | awk '{ if ($2 != "") print; else print "" }' 
```

+1

Nicole 24 sept '13 at 20:56

source share

Birei · Accepted Answer · 2012-03-26T20:09:59+0000

One solution, not grep , but with perl :

With patternfile.txt and inputfile.txt with the data of your original message. Following content script.pl should perform the task (I think that the line that you want to compare is the second column, otherwise it should be changed to instead use regexp This method is faster.):

 use warnings; use strict; ## Check arguments. die qq[Usage: perl $0 <pattern-file> <input-file>\n] unless @ARGV == 2; ## Open input files. open my $pattern_fh, qq[<], shift @ARGV or die qq[Cannot open pattern file\n]; open my $input_fh, qq[<], shift @ARGV or die qq[Cannot open input file\n]; ## Hash to save patterns. my (%pattern, %input); ## Read each pattern and save how many times appear in the file. while ( <$pattern_fh> ) { chomp; if ( exists $pattern{ $_ } ) { $pattern{ $_ }->[1]++; } else { $pattern{ $_ } = [ $., 1 ]; } } ## Read file with data and save them in another hash. while ( <$input_fh> ) { chomp; my @f = split; $input{ $f[1] } = $_; } ## For each pattern, search it in the data file. If it appears, print line those ## many times saved previously, otherwise print a blank line. for my $p ( sort { $pattern{ $a }->[0] <=> $pattern{ $b }->[0] } keys %pattern ) { if ( $input{ $p } ) { printf qq[%s\n], $input{ $p } for ( 1 .. $pattern{ $p }->[1] ); } else { # Old behaviour. # printf qq[\n]; # New requirement. printf qq[\n] for ( 1 .. $pattern{ $p }->[1] ); } }

Run it like this:

 perl script.pl patternfile.txt inputfile.txt

And gives the following conclusion:

 aatta 1243 qqqqqq yyyyy 1234 vvvvvv yyyyy 1234 vvvvvv yyyyy 1234 vvvvvv ppppp 1354 pppppp qqqqq 1677 eeeeee

Using grep with a template file: printing single and duplicate entries

More articles: