How to find the most common words in several separate texts?

Question

How to find the most common words in several separate texts?

A little simple question, but I can’t crack it. I have a string that is formatted as follows:

["category1",("data","data","data")] ["category2", ("data","data","data")]

I call different category posts, and I want to get the most frequently used words from the data section. So I tried:

 from nltk.tokenize import wordpunct_tokenize from collections import defaultdict freq_dict = defaultdict(int) for cat, text2 in posts: tokens = wordpunct_tokenize(text2) for token in tokens: if token in freq_dict: freq_dict[token] += 1 else: freq_dict[token] = 1 top = sorted(freq_dict, key=freq_dict.get, reverse=True) top = top[:50] print top

However, this will give me the top words PER in the string.

I need a general list of words above.
However, if I take the print from the top of the for loop, this only gives me the results of the last message.
Does anyone have an idea?

+4

python for-loop

Shifu May 04 '13 at 14:34

source share

4 answers

likeitlikeit · Answer 1 · 2013-05-04T14:38:18+0000

This is a problem with the area. In addition, you do not need to initialize the defaultdict elements, so this simplifies your code:

Try the following:

 posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]] from nltk.tokenize import wordpunct_tokenize from collections import defaultdict freq_dict = defaultdict(int) for cat, text2 in posts: tokens = wordpunct_tokenize(text2) for token in tokens: freq_dict[token] += 1 top = sorted(freq_dict, key=freq_dict.get, reverse=True) top = top[:50] print top

This is, as expected, exits

 ['data1', 'data3', 'data5', 'data2']

.

If you really have something like

 posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]

you don’t need wordpunct_tokenize() as an input, since the input is already marked. Then the following will work:

 posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]] from collections import defaultdict freq_dict = defaultdict(int) for cat, tokens in posts: for token in tokens: freq_dict[token] += 1 top = sorted(freq_dict, key=freq_dict.get, reverse=True) top = top[:50] print top

and it also outputs the expected result:

 ['data1', 'data3', 'data5', 'data2']

Fredrik pihl · Answer 2 · 2013-05-04T14:53:26+0000

Why not just use Counter ?

 In [30]: from collections import Counter In [31]: data=["category1",("data","data","data")] In [32]: Counter(data[1]) Out[32]: Counter({'data': 3}) In [33]: Counter(data[1]).most_common() Out[33]: [('data', 3)]

Janus Troelsen · Answer 3 · 2013-05-04T14:52:23+0000

 from itertools import chain from collections import Counter from nltk.tokenize import wordpunct_tokenize texts=["a quick brown car", "a fast yellow rose", "a quick night rider", "a yellow officer"] print Counter(chain.from_iterable(wordpunct_tokenize(x) for x in texts)).most_common(3)

outputs:

 [('a', 4), ('yellow', 2), ('quick', 2)]

As you can see in the documentation for Counter.most_common , the sorted list is sorted.

For use with your code you can do

 texts = (x[1] for x in posts)

or you can do

 ... wordpunct_tokenize(x[1]) for x in texts ...

If your posts really look like this:

 posts=[("category1",["a quick brown car", "a fast yellow rose"]), ("category2",["a quick night rider", "a yellow officer"])]

You can get rid of the categories:

 texts = list(chain.from_iterable(x[1] for x in posts))

( texts will be ['a quick brown car', 'a fast yellow rose', 'a quick night rider', 'a yellow officer'] )

You can then use this in the snippet at the top of this answer.

pradyunsg · Answer 4 · 2013-05-04T14:38:09+0000

Just change your code so that you can process the messages and then get the upper words:

 from nltk.tokenize import wordpunct_tokenize from collections import defaultdict freq_dict = defaultdict(int) for cat, text2 in posts: tokens = wordpunct_tokenize(text2) for token in tokens: freq_dict[token] += 1 # get top after all posts have been processed. top = sorted(freq_dict, key=freq_dict.get, reverse=True) top = top[:50] print top

How to find the most common words in several separate texts?

More articles: