I have two lists. I need to determine which word from the first list most often appears in the second list. The first, list1.txt contains a list of words sorted alphabetically, without duplicates. I used some scripts that guarantee that each word will be displayed in a unique line, for example:
canyon fish forest mountain river
The second file, list2.txt is located in UTF-8, and also contains many elements. I also used some scripts to ensure that each word is displayed in a unique line, but some elements are not words, and some can appear many times, for example:
fish canyon ocean ocean ocean ocean 1423 fish 109 fish 109 109 ocean
- The script should output the most frequently matched element. For example, if you run with the two files above, the output will be "fish", because this word from
list1.txt most often found in list2.txt .
Here is what I still have. First, he searches for every word and creates a CSV file with his matches:
#!/bin/bash while read -r line do count=$(grep -c ^$line list2.txt) echo $line","$count >> found.csv done < ./list1.txt
After that, found.csv sorted in descending order of the second column. The result is the word that appears in the first line. I do not think this is a good script, because it is not so efficient, and perhaps there may not be the most common matching element, for example, for example:
- If there is a connection between two or more words, for example, โfishโ, โcanyonโ and โforestโ appear 5 times, while others do not appear so often, these 3 words are output in alphabetical order, separated by commas, for example: "canyon, fish, forest."
- If none of the words from
list1.txt appears in list2.txt , then the output is simply the first word from the list1.txt file, for example. "Canyon".
How to create a more effective script that finds which word from the first list appears most often in the second?
source share