Getting DISTINCT Users on Google App Engine

How to do it in Google App Engine (Python):

SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" AND t >= start_time AND t <= end_time 

Long version:

I have a Python app for Google App Engine with users who generate events like pageviews. I would like to find out in a certain period of time how many unique users generated the pageview event. The time during which I am most interested in is one week, and this week there are about a million such events in the world. I want to run this in a cron job.

My event objects are as follows:

 class Event(db.Model): t = db.DateTimeProperty(auto_now_add=True) user = db.StringProperty(required=True) event_type = db.StringProperty(required=True) 

With a SQL database, I would do something like

 SELECT COUNT(DISTINCT user) FROM event WHERE event_type = "PAGEVIEW" AND t >= start_time AND t <= end_time 

This is thought to happen in order to receive all PAGEVIEW events and filter out duplicate users. Sort of:

 query = Event.all() query.filter("t >=", start_time) query.filter("t <=", end_time) usernames = [] for event in query: usernames.append(event.user) answer = len(set(usernames)) 

But this will not work, because it only supports up to 1000 events. The next thing that happens to me is to get 1000 events, then when they run out of the next thousand and so on. But this also will not work, because passing a thousand requests and retrieving a million objects will take more than 30 seconds, which is the time limit of the request.

Then I thought that the ORDER BY user should speed up skipping duplicates. But this is unacceptable, because I already use the inequality "t> = start_time AND t <= end_time".

It seems obvious that this cannot be done in 30 seconds, so it needs to be fragmented. But the search for individual elements does not seem to be subdivided. Best of all, I might think that on every cron desktop you can find 1000 events related to browsing pages, and then get different usernames from them and put them in an entity such as Chard. It may look something like

 class Chard(db.Model): usernames = db.StringListProperty(required=True) 

Thus, each chard would have up to 1000 user names in it, less if there were duplicates that were deleted. After about 16 hours (and this is good), I would have all the attributes and could do something like:

 chards = Chard.all() all_usernames = set() for chard in chards: all_usernames = all_usernames.union(chard.usernames) answer = len(all_usernames) 

It seems like this might work, but hardly a beautiful solution. And with fairly unique users, this cycle can take too long. I did not test it in the hope that someone would come up with a better offer, so if this cycle turns out to be fast enough.

Is there a nicer solution to my problem?

Of course, all of these unique user counts could easily be done using Google Analytics, but I am creating a dashboard for specific applications and intend to be the first of many features.

+4
source share
4 answers

Here is a possible solution. It depends on the use of memcache, so there is always a chance that your data will be evicted in an unpredictable way. Caution emptor.

You will have a memcache variable called unique_visits_today or something similar. Each time the user had the first view of the day of the day, you should use the .incr () function to increment this counter.

Determining that this is the user's first visit is done using the last_activity_day field attached to the user. When a user visits, you look at this field, and if he is yesterday, you update it to today and increase the memcache counter.

At midnight every day, the cron job takes the current value in the memcache counter and writes it to the data store when the counter is set to zero. You would have a model like this:

 class UniqueVisitsRecord(db.Model): # be careful setting date correctly if processing at midnight activity_date = db.DateProperty() event_count = IntegerProperty() 

You could simply, easily quickly get all UnqiueVisitsRecords that match any date range and add numbers to your event_count fields.

+1
source

Starting with SDK v1.7.4, there is now experimental support for the DISTINCT function.

See: https://developers.google.com/appengine/docs/python/datastore/gqlreference

+4
source

Google App Engine and more specifically GQL does not support the DISTINCT function.

But you can use the Python set function as described in this blog and this SO Question.

+1
source

NDB still does not support DISTINCT. I wrote a small utility method to be able to use various with GAE.

See here. http://verysimplescripts.blogspot.jp/2013/01/getting-distinct-properties-with-ndb.html

+1
source

Source: https://habr.com/ru/post/1299745/


All Articles