Develop an algorithm to return the number of unique users between a given time interval

A web server can receive millions of user login requests. The user can log in several times. Development of an optimal algorithm / data structure in terms of time complexity to return the total number of unique users between given time intervals.

For example, for example: the total number of unique users between the intervals t1 and t2 and t2 and t3. Also consider returning the total score for overlapping intervals (t1 = 10am, t2 = 10:15 in the morning, t3 = 10:30 in the morning, returning the total number of users from 10:10 to 10:20)

Below is what I offer, will people reviews be appreciated?

An IMO combination of Hashmap and min heap would be good as an optimal solution.

Hashmap- will have a key as a user identifier and a value as a node pointer to the corresponding node in min. Min heap - time of the last login as a key and value as a user identifier. The root will be the user whose login time is the oldest. Also, the total number of nodes in the mini-heap is calculated in the root store so that we can quickly return the account.

  • When a user logs in. Search in hashmap with user id as key. a) If there is no match, insert the new user in hashmap and the new node for the user in the minimum heap and increase the counter stored with the root node of the minimum heap. b) otherwise the old user will update the value of the last entry and not increase the count in the root folder of the node of the minimum heap.

  • Whenever we want to know the unique users registered between t2-t1, then a. Extract min (root) from the heap and check if the current time is the last login time> t2-t1 min. If it is more than remove the value from hashmap and min_heap. b. Repeat the above step (a) until the minimum heap element satisfies the current time - the time of the last entry <= t2-t1 mins s. Return the counter value from the root of the minimum heap node.

But I can not smooth the interval overlap algorithm.

+4
source share
1 answer

I think there is a much simpler way to do this. Consider storing all the data in a balanced binary search tree, where the keys are the login time, and the values ​​are a list of all the people who logged in at that time (provided that there can be several logins at the same time time), From there you can find all the people who logged in during the time interval between T1 and T2, finding the smallest node in BST whose time is greater than or equal to T1, and then continuously calculates the successor of the node sequence until you come to node, which is on time with three after time T2.

Doing a search in BST to find the first node at times greater than or equal to T1 will take O (log n) time in a balanced BST, and computing an inorder successor will take O (k) times many times, where k is the total number of matches you report . This requires a total time of O (log n + k). Since you have to spend at least O (k) time reporting all the coincidence logic in any algorithm, this has very low overhead.

Alternatively, if you receive data in a stream from the server (i.e. new logins always occur as time develops), you can simply use the standard array to store all requests. You can simply add new queries to the end of the array. Since time always moves forward, this means the array is always sorted, so you can use a binary search to find the starting point of the range. Assuming the data is not pathologically constructed, you can also use the interpolation search so that the expected search time is O (log log n) rather than O (log n), giving the expected O (log log n + k) search times when searching k full elements.

As for the processing of overlapping ranges, there are standard algorithms for collecting a set of ranges and combining them into a minimum number of non-overlapping ranges. You can always apply one of these methods before performing a search to handle this case.

Hope this helps!

+1
source

Source: https://habr.com/ru/post/1487684/


All Articles