Counting equal lines in two files

Question

Counting equal lines in two files

Let's say I have two files and you want to know how many equal lines they have. For example, file1 is

1 3 2 4 5 0 10

and file2 contains

 3 10 5 64 15

In this case, the answer should be 3 (common lines: "3", "10" and "5").

This, of course, was done quite simply with python, for example, but I was curious to do it from bash (with some standard utilities or additional things like awk or something else). Here is what I came up with:

  cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l

This seems too complicated for the task, so I wonder if there is an easier or more elegant way to achieve the same result.

PS The output of the percentage of the total part of the number of lines in each file will also be pleasant, although this is not necessary.

UPD: Files do not have duplicate lines

+6

bash awk

mikhail Aug 13 '14 at 10:02

source share

7 answers

with your input example, this works too. but if the files are huge, I prefer awk solutions to others:

 grep -cFwf file2 file1

with your input files, the above outputs

+2

Kent Aug 13 '14 at 10:18

source share

Here is one without awk, which uses comm instead:

 comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l

comm compares two sorted files. Arguments 1.2 suppress the unique lines found in both files. The conclusion is the lines that they have in separate lines. wc -l counts the number of lines.

Output without wc -l :

 10 3 5

And when counting (obviously):

+1

keyser Aug 13 '14 at 10:20

source share

You can do everything with awk:

 awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2

To get the percentage, something like this works:

 awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2

and displays

 file1 0.428571 file2 0.6

In your decision, you can get rid of one cat:

 sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l

0

martin Aug 13 '14 at 10:10

source share

You can also use the comm command. Remember that you will have to sort the files you need to compare first:

 [ gc@slave ~]$ sort a > sorted_1 [ gc@slave ~]$ sort b > sorted_2 [ gc@slave ~]$ comm -1 -2 sorted_1 sorted_2 10 3 5

From man pages for the command comm: comm - compare two sorted files line by line Parameters:

 -1 suppress column 1 (lines unique to FILE1) -2 suppress column 2 (lines unique to FILE2) -3 suppress column 3 (lines that appear in both files)

0

Technext Aug 13 '14 at 10:21

source share

How about keeping it beautiful and simple ...

This is all you need:

 cat file1 file2 | sort -n | uniq -d | wc -l 3

man sorts: -n, --numeric-sort - compare according to the numeric value of the string

uniq man: -d, --repeated - print only repeating lines

wc man: -l, --lines - print the number of new lines

Hope this helps.

EDIT - another process (credit mart):

 sort file1 file2 | uniq -d | wc -l

0

mattst Aug 13 '14 at 10:25

source share

One way: awk :

 awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2

Output:

0

John b Aug 13 '14 at 11:58

source share

Aserre · Accepted Answer · 2014-08-13T10:11:58+0000

To find lines common to your 2 files using awk:

 awk 'a[$0]++' file1 file2

Output 3 10 15

Now just bring this to wc to get the number of common lines:

 awk 'a[$0]++' file1 file2 | wc -l

Output 3 .

Please note that this solution takes into account duplicates, i.e. if you have:

 file1 | file2 1 | 3 2 | 3 3 | 3

awk 'a[$0]++' file1 file2 will print 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l awk 'a[$0]++' file1 file2 | wc -l will print 3

Counting equal lines in two files

More articles: