How to find which line from the first file most often appears in the second file?

I have two lists. I need to determine which word from the first list most often appears in the second list. The first, list1.txt contains a list of words sorted alphabetically, without duplicates. I used some scripts that guarantee that each word will be displayed in a unique line, for example:

 canyon fish forest mountain river 

The second file, list2.txt is located in UTF-8, and also contains many elements. I also used some scripts to ensure that each word is displayed in a unique line, but some elements are not words, and some can appear many times, for example:

 fish canyon ocean ocean ocean ocean 1423 fish 109 fish 109 109 ocean 
  • The script should output the most frequently matched element. For example, if you run with the two files above, the output will be "fish", because this word from list1.txt most often found in list2.txt .

Here is what I still have. First, he searches for every word and creates a CSV file with his matches:

 #!/bin/bash while read -r line do count=$(grep -c ^$line list2.txt) echo $line","$count >> found.csv done < ./list1.txt 

After that, found.csv sorted in descending order of the second column. The result is the word that appears in the first line. I do not think this is a good script, because it is not so efficient, and perhaps there may not be the most common matching element, for example, for example:

  • If there is a connection between two or more words, for example, โ€œfishโ€, โ€œcanyonโ€ and โ€œforestโ€ appear 5 times, while others do not appear so often, these 3 words are output in alphabetical order, separated by commas, for example: "canyon, fish, forest."
  • If none of the words from list1.txt appears in list2.txt , then the output is simply the first word from the list1.txt file, for example. "Canyon".

How to create a more effective script that finds which word from the first list appears most often in the second?

+4
source share
3 answers

You can use the following pipeline:

 grep -Ff list1.txt list2.txt | sort | uniq -c | sort -n | tail -n1 

F tells grep to search for literary words, F tells him to use list1.txt as a list of words to search for. The rest sorts the matches, counts the duplicates and sorts them according to the number of occurrences. The last part selects the last row, i.e. the most common (plus the number of occurrences).

+7
source
 > awk 'FNR==NR{a[$1]=0;next}($1 in a){a[$1]++}END{for(i in a)print a[i],i}' file1 file2 | sort -rn|head -1 
+2
source

Assuming 'list1.txt' is sorted, I would use unix join :

 sort list2.txt | join -1 1 -2 1 list1.txt - | sort |\ uniq -c | sort -n | tail -n1 
+1
source

Source: https://habr.com/ru/post/1433742/


All Articles