I am making a simple mining system using the Naive Bayes classifier .
To train my classifier, I have a text file in which each line contains a list of tokens (generated from a tweet) and the feeling associated with them (0 for -ve, 4 for positive).
For instance:
0 @ switchfoot http :
Now, what I'm trying to do for each token, count how many times this happens in a positive tweet, and how many times this happens in a negative tweet. Then I plan to use these calculations to calculate probabilities. I use the built-in dictionary to store these counters. The keys are tokens, and the values ββare whole arrays of size 2.
The problem is that this code launches rather quickly, but continues to slow down, and when it processed about 200 thousand tweets, it becomes very slow - about 1 tweet per second. Since there are 1.6 million tweets in my training kit, it's too slow. The code I have is:
def compute_counts(infile): f = open(infile) counts = {} i = 0 for line in f: i = i + 1 print(i) words = line.split(' ') for word in words[1:]: word = word.replace('\n', '').replace('\r', '') if words[0] == '0': if word in counts.keys(): counts[word][0] += 1 else: counts[word] = [1, 0] else: if word in counts.keys(): counts[word][1] += 1 else: counts[word] = [0, 1] return counts
What can I do to speed up this process? Better data structure?
Edit: not a duplicate, the question is not about something faster than a dict in the general case, but in this particular use case.
source share