Real-time tracking of the top 100 words per minute per minute / hour / day

I recently came across this interview question:

Given a continuous twitter feed, design an algorithm to return the 100 most frequent words used at this minute, this hour and this day. 

I was thinking of a system with a word -> count hash map associated with 3 minutes for the current min, hour, and day.

Each incoming message is marked, disinfected, and the number of words is updated in the hash map (and increases the key in heaps if the word already exists in it)

If any of the words does not exist on the heap (and heap size == 100), check if their frequency > min value on the heap, and if so, extract -min and paste it into the heap.

Are there any better ways to do this?

+6
source share
2 answers

Your algorithm is a good start, but it will not give the correct results. The problem is that hash tables the way you describe them are a one-way street: after adding a word, it remains unchanged.

You need an array of 1440 (24 * 60) word+count hash cards organized as you describe; these are your minute counts. You need two additional hash cards - for the total number of hours and a day.

Define two operations on hash maps - add and subtract , with the semantics of merging coincidences of identical words and deleting words when their number drops to zero.

From every minute you start a new hash map and update the counters from the feed. At the end of the minute, you put this hash card in the array for the current minute, add it to the total for one hour and one day, and then subtract the hash card hours ago from the hourly total and subtract the hash card 24 hours ago from the daily total amount.

Finally, you need a way to create the 100 best words based on a hash map. This should be a trivial task - to add elements to the array of word+count entries, sort by score and save the top 100.

+6
source

dasblinkenlight made a good point for the mistake of not excluding items from your hash map.

There is one more thing that needs to be added in order to actually compute the top K-words given by min / hour / day, it is faster to use the (O (n)) section rather than sorting (O (nlgn)):

  • dumps a hash file from minutes / hours / days to an array: O (n)
  • use the median median selection to get the Kth element: O (n)
  • partition around the Kth element: O (n)

NTN.

+1
source

Source: https://habr.com/ru/post/913415/


All Articles