Counting unique words in a document using Python

I am new to Python, trying to understand the answer here to the question of counting unique words in a document. Answer:

print len(set(w.lower() for w in open('filename.dat').read().split())) 

Reads the entire file in memory, breaks it into words using whitespace, converts each word to lower case, creates a (unique) set of lowercase words, counts them and prints the output

To understand this, I am trying to implement it in Python step by step. I can import a text plate using open and read, divide it into separate words using split, and make them all lowercase using lowercase. I can also create a set of unique words in the list. However, I can’t understand how to do the last part - to count the number of unique words.

I thought I could finish by repeating the elements in the set of unique words and counting them in the original lower case list, but I find that the constructed construct is not indexable.

So, I think I'm trying to do something that in natural language, like all the elements in a set, tell me how many times they occur in lowercase. But I can’t figure out how to do this, and I suspect that some kind of misunderstanding of Python is holding me back.

  • EDIT -

The guys are grateful for the answers. I just realized that I did not explain myself correctly - I wanted to find not only the total number of unique words (which, as I understand it, the length of the set), but also the number of times each individual word was used, for example. "the" was used 14 times, "and" was used 9 times, "it" was used 20 times and so on. Apologies for the confusion.

+6
source share
6 answers

I believe Counter is all you need in this case:

 from collections import Counter print Counter(yourtext.split()) 
+11
source

You can count the number of elements in a set, list or tuple anyway using len(my_set) or len(my_list) .

Edit: calculating the number of times a word is used, something else.
Here's an obvious approach:

 count = {} for w in open('filename.dat').read().split(): if w in count: count[w] += 1 else: count[w] = 1 for word, times in count.items(): print "%s was found %d times" % (word, times) 

If you want to avoid the if clause, you can look at collections.defaultdict .

+6
source

A set by definition contains unique elements (in your case, you cannot find the same "bottom cased string" twice). So, you just need to get the number of elements in the set = set length = len(set(...))

+4
source

Your question already has an answer. If s is a set of unique words in a document, then len(s) sets the number of elements in the set, that is, the number of unique words in a document.

+1
source

You can use counter

 from collections import Counter c = Counter(['mama','papa','mama']) 

Result c will be

 Counter({'mama': 2, 'papa': 1}) 
+1
source

I would say that this code counts the number of different words, not the number of unique words, which is the number of words that occur only once.

Here, the number of times each word occurs is counted:

 from collections import defaultdict word_counts = defaultdict(int) for w in open('filename.dat').read().split(): word_counts[w.lower()] += 1 for w, c in word_counts.iteritems(): print w, "occurs", word_counts[w], "times" 
0
source

Source: https://habr.com/ru/post/889885/


All Articles