Reading words from a txt file - Python

I developed a code that is responsible for reading the words of a txt file, in my case "elquijote.txt", to then use the dictionary {key: value} to show the words that appear and their occurrences.

For example, for the file test1.txt with the following words:

hello hello hello good bye bye 

The output of my program:

  hello 3 good 1 bye 2 

Another option that the program has is that it shows those words that appear more times than the number that we entered through the argument.

If in the shell we added the following command: "python readwords.py text.txt 2", it will show those words contained in the test1.txt file that appear more than the number we entered, in this case 2

Output:

 hello 3 

Now we can introduce the third argument of common words, such as the definition of conjunctions, which, being so generalized, we do not want to show or enter in our dictionary.

My code is working correctly, the problem is that using huge files like "elquijote.txt" takes a long time to complete the process.

I thought, and this is due to the fact that I use my helper lists to eliminate words.

I thought that this decision was not to enter in my lists those words that appear in the txt file, which is entered with an argument that contains the words to refuse.

Here is my code:

 def contar(aux): counts = {} for palabra in aux: palabra = palabra.lower() if palabra not in counts: counts[palabra] = 0 counts[palabra] += 1 return counts def main(): characters = '!?¿-.:;-,><=*»¡' aux = [] counts = {} with open(sys.argv[1],'r') as f: aux = ''.join(c for c in f.read() if c not in characters) aux = aux.split() if (len(sys.argv)>3): with open(sys.argv[3], 'r') as f: remove = "".join(c for c in f.read()) remove = remove.split() #Borrar del archivo for word in aux: if word in remove: aux.remove(word) counts = contar(aux) for word, count in counts.items(): if count > int(sys.argv[2]): print word, count if __name__ == '__main__': main() 

The Contar function enters words in a dictionary.

And the main function enters into the "aux" list those words that do not contain symbolic characters, and then removes from the same list those "forbidden" words loaded from another .txt file.

I think that the right decision would be to abandon the forbidden words in which I discard characters that are not accepted, but after several attempts I was not able to do it correctly.

Here you can check my code online: https://repl.it/Nf3S/54 Thank you.

+5
source share
3 answers

There are several drawbacks here. I rewrote your code to take advantage of some of these optimizations. The rationale for each change is contained in the comments / doc lines:

 # -*- coding: utf-8 -*- import sys from collections import Counter def contar(aux): """Here I replaced your hand made solution with the built-in Counter which is quite a bit faster. There no real reason to keep this function, I left it to keep your code interface intact. """ return Counter(aux) def replace_special_chars(string, chars, replace_char=" "): """Replaces a set of characters by another character, a space by default """ for c in chars: string = string.replace(c, replace_char) return string def main(): characters = '!?¿-.:;-,><=*»¡' aux = [] counts = {} with open(sys.argv[1], "r") as f: # You were calling lower() once for every `word`. Now we only # call it once for the whole file: contents = f.read().strip().lower() contents = replace_special_chars(contents, characters) aux = contents.split() #Borrar del archivo if len(sys.argv) > 3: with open(sys.argv[3], "r") as f: # what you had here was very ineffecient: # remove = "".join(c for c in f.read()) # that would create an array or characters then join them together as a string. # this is a bit silly because it identical to f.read(): # "".join(c for c in f.read()) === f.read() ignore_words = set(f.read().strip().split()) """ignore_words is a `set` to allow for very fast inclusion/exclusion checks""" aux = (word for word in aux if word not in ignore_words) counts = contar(aux) for word, count in counts.items(): if count > int(sys.argv[2]): print word, count if __name__ == '__main__': main() 
+1
source

Here are a couple of optimizations:

  • Use .Counter () collections to count items in contar ()
  • Use string.translate () to remove unwanted characters
  • Pop elements from the list of ignored words after counting, instead of removing them from the source text.

The speed is a bit, but not an order of magnitude.

 #!/usr/bin/python # -*- coding: utf-8 -*- import sys import os import collections def contar(aux): return collections.Counter(aux) def main(): characters = '!?¿-.:;-,><=*»¡' aux = [] counts = {} with open(sys.argv[1],'r') as f: text = f.read().lower().translate(None, characters) aux = text.split() if (len(sys.argv)>3): with open(sys.argv[3], 'r') as f: remove = set(f.read().strip().split()) else: remove = [] counts = contar(aux) for r in remove: counts.pop(r, None) for word, count in counts.items(): if count > int(sys.argv[2]): print word, count if __name__ == '__main__': main() 
+2
source

A few changes and reasoning:

  • The command line arguments in the __name__ == 'main' : section . In doing so, you are modularizing your code because it only requests command line arguments when you run this script itself, and not to import a function from another script.
  • Use regex to filter words with characters you don’t want: Using regex allows you to say which characters you want or which characters you DO NOT need, whichever is less. In this case, hard coding every special character that you don't want is a tedious task compared to declaring the characters you want to use in a simple regular expression pattern. In the following script, I filter out words that are not alphanumeric using the template [aA-zZ0-9]+ .
  • Apologize to permission . Since the command line argument of the minimum value is optional, it will obviously not always be present. Therefore, we can be pythonic with try except blocks to try to define the minimum score as sys.argv[2] and catch the IndexError exception by default of the minimum value up to 0 .

Python script:

 # sys import sys # regex import re def main(text_file, min_count): word_count = {} with open(text_file, 'r') as words: # Clean words of linebreaks and split # by ' ' to get list of words words = words.read().strip().split(' ') # Filter words that are not alphanum pattern = re.compile(r'^[aA-zZ0-9]+$') words = filter(pattern.search,words) # Iterate through words and collect # count for word in words: if word in word_count: word_count[word] = word_count[word] + 1 else: word_count[word] = 1 # Iterate for output for word, count in word_count.items(): if count > min_count: print('%s %s' % (word, count)) if __name__ == '__main__': # Get text file name text_file = sys.argv[1] # Attempt to get minimum count # from command line. # Default to 0 try: min_count = int(sys.argv[2]) except IndexError: min_count = 0 main(text_file, min_count) 

Text file:

 hello hello hello good bye goodbye !bye bye¶ b?e goodbye 

Command:

 python script.py text.txt 

Output:

 bye 1 good 1 hello 3 goodbye 2 

With a minimum count command:

 python script.py text.txt 2 

Output:

 hello 3 
+1
source

Source: https://habr.com/ru/post/1273105/


All Articles