How to do it in Google App Engine (Python):
SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" AND t >= start_time AND t <= end_time
Long version:
I have a Python app for Google App Engine with users who generate events like pageviews. I would like to find out in a certain period of time how many unique users generated the pageview event. The time during which I am most interested in is one week, and this week there are about a million such events in the world. I want to run this in a cron job.
My event objects are as follows:
class Event(db.Model): t = db.DateTimeProperty(auto_now_add=True) user = db.StringProperty(required=True) event_type = db.StringProperty(required=True)
With a SQL database, I would do something like
SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" AND t >= start_time AND t <= end_time
This is thought to happen in order to receive all PAGEVIEW events and filter out duplicate users. Sort of:
query = Event.all() query.filter("t >=", start_time) query.filter("t <=", end_time) usernames = [] for event in query: usernames.append(event.user) answer = len(set(usernames))
But this will not work, because it only supports up to 1000 events. The next thing that happens to me is to get 1000 events, then when they run out of the next thousand and so on. But this also will not work, because passing a thousand requests and retrieving a million objects will take more than 30 seconds, which is the time limit of the request.
Then I thought that the ORDER BY user should speed up skipping duplicates. But this is unacceptable, because I already use the inequality "t> = start_time AND t <= end_time".
It seems obvious that this cannot be done in 30 seconds, so it needs to be fragmented. But the search for individual elements does not seem to be subdivided. Best of all, I might think that on every cron desktop you can find 1000 events related to browsing pages, and then get different usernames from them and put them in an entity such as Chard. It may look something like
class Chard(db.Model): usernames = db.StringListProperty(required=True)
Thus, each chard would have up to 1000 user names in it, less if there were duplicates that were deleted. After about 16 hours (and this is good), I would have all the attributes and could do something like:
chards = Chard.all() all_usernames = set() for chard in chards: all_usernames = all_usernames.union(chard.usernames) answer = len(all_usernames)
It seems like this might work, but hardly a beautiful solution. And with fairly unique users, this cycle can take too long. I did not test it in the hope that someone would come up with a better offer, so if this cycle turns out to be fast enough.
Is there a nicer solution to my problem?
Of course, all of these unique user counts could easily be done using Google Analytics, but I am creating a dashboard for specific applications and intend to be the first of many features.