Counting equal lines in two files

Let's say I have two files and you want to know how many equal lines they have. For example, file1 is

1 3 2 4 5 0 10 

and file2 contains

 3 10 5 64 15 

In this case, the answer should be 3 (common lines: "3", "10" and "5").

This, of course, was done quite simply with python, for example, but I was curious to do it from bash (with some standard utilities or additional things like awk or something else). Here is what I came up with:

  cat file1 file2 | sort | uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l 

This seems too complicated for the task, so I wonder if there is an easier or more elegant way to achieve the same result.

PS The output of the percentage of the total part of the number of lines in each file will also be pleasant, although this is not necessary.

UPD: Files do not have duplicate lines

+6
source share
7 answers

To find lines common to your 2 files using awk:

 awk 'a[$0]++' file1 file2 

Output 3 10 15

Now just bring this to wc to get the number of common lines:

 awk 'a[$0]++' file1 file2 | wc -l 

Output 3 .

Please note that this solution takes into account duplicates, i.e. if you have:

 file1 | file2 1 | 3 2 | 3 3 | 3 

awk 'a[$0]++' file1 file2 will print 3 3 3 and awk 'a[$0]++' file1 file2 | wc -l awk 'a[$0]++' file1 file2 | wc -l will print 3

+8
source

with your input example, this works too. but if the files are huge, I prefer awk solutions to others:

 grep -cFwf file2 file1 

with your input files, the above outputs

 3 
+2
source

Here is one without awk, which uses comm instead:

 comm -12 <(sort file1.txt) <(sort file2.txt) | wc -l 

comm compares two sorted files. Arguments 1.2 suppress the unique lines found in both files. The conclusion is the lines that they have in separate lines. wc -l counts the number of lines.

Output without wc -l :

 10 3 5 

And when counting (obviously):

 3 
+1
source

You can do everything with awk:

 awk '{ a[$0] += 1} END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print c}' file1 file2 

To get the percentage, something like this works:

 awk '{ a[$0] += 1; if (NR == FNR) { b = FILENAME; n = NR} } END { c = 0; for ( i in a ) { if ( a[i] > 1 ) c++; } print b, c/n; print FILENAME, c/FNR;}' file1 file2 

and displays

 file1 0.428571 file2 0.6 

In your decision, you can get rid of one cat:

 sort file1 file2| uniq -c | awk '{if ($1 > 1) {$1=""; print $0}}' | wc -l 
0
source

You can also use the comm command. Remember that you will have to sort the files you need to compare first:

 [ gc@slave ~]$ sort a > sorted_1 [ gc@slave ~]$ sort b > sorted_2 [ gc@slave ~]$ comm -1 -2 sorted_1 sorted_2 10 3 5 

From man pages for the command comm: comm - compare two sorted files line by line Parameters:

 -1 suppress column 1 (lines unique to FILE1) -2 suppress column 2 (lines unique to FILE2) -3 suppress column 3 (lines that appear in both files) 
0
source

How about keeping it beautiful and simple ...

This is all you need:

 cat file1 file2 | sort -n | uniq -d | wc -l 3 

man sorts: -n, --numeric-sort - compare according to the numeric value of the string

uniq man: -d, --repeated - print only repeating lines

wc man: -l, --lines - print the number of new lines

Hope this helps.

EDIT - another process (credit mart):

 sort file1 file2 | uniq -d | wc -l 
0
source

One way: awk :

 awk 'NR==FNR{a[$0]; next}$0 in a{n++}END{print n}' file1 file2 

Output:

 3 
0
source

Source: https://habr.com/ru/post/973754/


All Articles