What data structure should I use here?

Question

What data structure should I use here?

The new programmer is here. At the moment I have a dictionary for my program, containing all the years and how many complete words were used in the literature for each year.

Now I need to find the relative frequency this year by looking at a specific word given by the user. The relative frequency is determined by using the number of times a particular word was used, and dividing it by the total number of words that were used during that year.

Do I need to make another dictionary containing the year and the number of times the word was used in this year? Or another data structure completely? It should also be mentioned that the user provides a start and end date.

Below is my function for the dictionary that I have. If you have suggestions on how to make this better, I’m all ears!

yearTotal = dict() def addTotal(): with open('total_counts.csv') as allWords: readW = csv.reader(allWords, delimiter=',') for row in readW: yearTotal[row[0]] = row[1] addTotal()

+5

python

Blakester Nov 30 '16 at 10:09

source share

1 answer

Marat · Answer 1 · 2016-11-30T22:32:26+0000

I assume that you have few years (maybe up to several hundred), so I expect the list and dictionary will have a similar search time. However, dictionaries are semantically more convenient.

At the same time, you probably have a lot of words for each year, so it’s better to use a structure with constant (O (1)) searches, so it is.

 from collections import defaultdict yearTotal = defaultdict(labda: defaultdict(int)) fh = open('total_counts.csv') for year, word in csv.reader(fh, delimiter=","): yearTotal[year][''] += 1 # here we'll cache the number of words yearTotal[year][word] += 1 # ... word = "foo" year = "1984" relative_frequency = float(yearTotal[year][word]) / yearTotal[year]['']

What data structure should I use here?

More articles: