If you are interested in only the most popular words N, and you do not need it, then there is a very smart structure that you can use. I heard about it through Udi Manber, it works as follows:
You create an array of N elements, each element keeps track of the value and counter, you also save a counter that is indexed into this array. In addition, you have a map from the value to be indexed into this array. Each time you update your structure with a value (for example, words from a text stream), you first check your map to see if that value is already in your array if you increment the counter for that value. If this is not the case, you decrease the number of items that your counter points to, and then increase the counter.
It sounds simple, and nothing about the algorithm seems like it will bring anything useful, but for typical real data, it tends to do very well. Usually, if you want to keep track of the top N things that you might need to create this structure with a capacity of 10 * N, since it will have many empty values. Using the King James Bible as an input, here is what this structure represents as the most common words (in a specific order):
0 : in 1 : And 2 : shall 3 : of 4 : that 5 : to 6 : he 7 : and 8 : the 9 : I
And here are the ten most common words (in order):
0 : the , 62600 1 : and , 37820 2 : of , 34513 3 : to , 13497 4 : And , 12703 5 : in , 12216 6 : that , 11699 7 : he , 9447 8 : shall , 9335 9 : unto , 8912
You see that he got 9 of the top 10 words, and he did it using space for only 50 elements. Depending on your use case, saving on location here can be very beneficial. It is also very fast.
Here is the topN implementation I used written in Go:
type Event string type TopN struct { events []Event counts []int current int mapped map[Event]int } func makeTopN(N int) *TopN { return &TopN{ counts: make([]int, N), events: make([]Event, N), current: 0, mapped: make(map[Event]int, N), } } func (t *TopN) RegisterEvent(e Event) { if index, ok := t.mapped[e]; ok { t.counts[index]++ } else { if t.counts[t.current] == 0 { t.counts[t.current] = 1 t.events[t.current] = e t.mapped[e] = t.current } else { t.counts[t.current]-- if t.counts[t.current] == 0 { delete(t.mapped, t.events[t.current]) } } } t.current = (t.current + 1) % len(t.counts) }