I developed a code that is responsible for reading the words of a txt file, in my case "elquijote.txt", to then use the dictionary {key: value} to show the words that appear and their occurrences.
For example, for the file test1.txt with the following words:
hello hello hello good bye bye
The output of my program:
hello 3 good 1 bye 2
Another option that the program has is that it shows those words that appear more times than the number that we entered through the argument.
If in the shell we added the following command: "python readwords.py text.txt 2", it will show those words contained in the test1.txt file that appear more than the number we entered, in this case 2
Output:
hello 3
Now we can introduce the third argument of common words, such as the definition of conjunctions, which, being so generalized, we do not want to show or enter in our dictionary.
My code is working correctly, the problem is that using huge files like "elquijote.txt" takes a long time to complete the process.
I thought, and this is due to the fact that I use my helper lists to eliminate words.
I thought that this decision was not to enter in my lists those words that appear in the txt file, which is entered with an argument that contains the words to refuse.
Here is my code:
def contar(aux): counts = {} for palabra in aux: palabra = palabra.lower() if palabra not in counts: counts[palabra] = 0 counts[palabra] += 1 return counts def main(): characters = '!?¿-.:;-,><=*»¡' aux = [] counts = {} with open(sys.argv[1],'r') as f: aux = ''.join(c for c in f.read() if c not in characters) aux = aux.split() if (len(sys.argv)>3): with open(sys.argv[3], 'r') as f: remove = "".join(c for c in f.read()) remove = remove.split() #Borrar del archivo for word in aux: if word in remove: aux.remove(word) counts = contar(aux) for word, count in counts.items(): if count > int(sys.argv[2]): print word, count if __name__ == '__main__': main()
The Contar function enters words in a dictionary.
And the main function enters into the "aux" list those words that do not contain symbolic characters, and then removes from the same list those "forbidden" words loaded from another .txt file.
I think that the right decision would be to abandon the forbidden words in which I discard characters that are not accepted, but after several attempts I was not able to do it correctly.
Here you can check my code online: https://repl.it/Nf3S/54 Thank you.