Your algorithm is a good start, but it will not give the correct results. The problem is that hash tables the way you describe them are a one-way street: after adding a word, it remains unchanged.
You need an array of 1440
(24 * 60) word+count
hash cards organized as you describe; these are your minute counts. You need two additional hash cards - for the total number of hours and a day.
Define two operations on hash maps - add
and subtract
, with the semantics of merging coincidences of identical words and deleting words when their number drops to zero.
From every minute you start a new hash map and update the counters from the feed. At the end of the minute, you put this hash card in the array for the current minute, add it to the total for one hour and one day, and then subtract the hash card hours ago from the hourly total and subtract the hash card 24 hours ago from the daily total amount.
Finally, you need a way to create the 100 best words based on a hash map. This should be a trivial task - to add elements to the array of word+count
entries, sort by score and save the top 100.
source share