Is Python a faster alternative to a dictionary?

I am making a simple mining system using the Naive Bayes classifier .

To train my classifier, I have a text file in which each line contains a list of tokens (generated from a tweet) and the feeling associated with them (0 for -ve, 4 for positive).

For instance:

 0 @ switchfoot http : //twitpic.com/2y1zl - Awww , that a bummer . You shoulda got David Carr of Third Day to do it . ; D 0 spring break in plain city ... it snowing 0 @ alydesigns i was out most of the day so did n't get much done 0 some1 hacked my account on aim now i have to make a new one 0 really do n't feel like getting up today ... but got to study to for tomorrows practical exam ... 

Now, what I'm trying to do for each token, count how many times this happens in a positive tweet, and how many times this happens in a negative tweet. Then I plan to use these calculations to calculate probabilities. I use the built-in dictionary to store these counters. The keys are tokens, and the values ​​are whole arrays of size 2.

The problem is that this code launches rather quickly, but continues to slow down, and when it processed about 200 thousand tweets, it becomes very slow - about 1 tweet per second. Since there are 1.6 million tweets in my training kit, it's too slow. The code I have is:

 def compute_counts(infile): f = open(infile) counts = {} i = 0 for line in f: i = i + 1 print(i) words = line.split(' ') for word in words[1:]: word = word.replace('\n', '').replace('\r', '') if words[0] == '0': if word in counts.keys(): counts[word][0] += 1 else: counts[word] = [1, 0] else: if word in counts.keys(): counts[word][1] += 1 else: counts[word] = [0, 1] return counts 

What can I do to speed up this process? Better data structure?

Edit: not a duplicate, the question is not about something faster than a dict in the general case, but in this particular use case.

+6
source share
1 answer

Do not use if word in counts.keys() If you do this, you will end up looking sequentially with keys, which should avoid dict .

Just put if word in counts .

Or use defaultdict . https://docs.python.org/2/library/collections.html#collections.defaultdict

+10
source

Source: https://habr.com/ru/post/975255/


All Articles