This is a problem with the area. In addition, you do not need to initialize the defaultdict elements, so this simplifies your code:
Try the following:
posts = [["category1",("data1 data2 data3")],["category2", ("data1 data3 data5")]] from nltk.tokenize import wordpunct_tokenize from collections import defaultdict freq_dict = defaultdict(int) for cat, text2 in posts: tokens = wordpunct_tokenize(text2) for token in tokens: freq_dict[token] += 1 top = sorted(freq_dict, key=freq_dict.get, reverse=True) top = top[:50] print top
This is, as expected, exits
['data1', 'data3', 'data5', 'data2']
.
If you really have something like
posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]]
you don’t need wordpunct_tokenize() as an input, since the input is already marked. Then the following will work:
posts = [["category1",("data1","data2","data3")],["category2", ("data1","data3","data5")]] from collections import defaultdict freq_dict = defaultdict(int) for cat, tokens in posts: for token in tokens: freq_dict[token] += 1 top = sorted(freq_dict, key=freq_dict.get, reverse=True) top = top[:50] print top
and it also outputs the expected result:
['data1', 'data3', 'data5', 'data2']