How to make subtitles in BigQuery

Question

How to make subtitles in BigQuery

We have a huge event table in which users register and play our games.

Now I want to determine the retention of the second day for each day, which is the percentage of players registering the day before, as well as the next day.

So, suppose we have three fields

timestamp ts int userId int eventId (Ie 1 = Register, 2 = Login)

How is this done in BigQuery syntax? that is, I would like to get the following result:

 Date Register Logins day after % Second day retention 2013-08-23 25 563 4 567 17.8

I was unable to complete the subqueries and joins, but this should be doable!

+4

google-bigquery

Gunnar eketrapp Aug 27 '13 at 15:43

source share

1 answer

Felipe hoffa · Accepted Answer · 2013-08-27T18:02:59+0000

How about this public data request:

 SELECT a.day, first_day, return_next_day, integer((return_next_day / first_day) * 100) percent FROM ( SELECT COUNT(DISTINCT actor, 50000) first_day, STRFTIME_UTC_USEC( UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day, FROM [publicdata:samples.github_timeline] GROUP BY day) a JOIN ( SELECT COUNT(*) return_next_day, day FROM ( SELECT a.day day, a.actor, b.day, b.actor FROM ( SELECT STRFTIME_UTC_USEC( UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day, MAX(STRFTIME_UTC_USEC(86400000000 + UTC_USEC_TO_DAY( PARSE_UTC_USEC(created_at)), "%Y-%m-%d")) dayplus, actor FROM [publicdata:samples.github_timeline] GROUP EACH BY actor, day) a JOIN EACH ( SELECT STRFTIME_UTC_USEC( UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day, actor FROM [publicdata:samples.github_timeline] GROUP EACH BY actor, day) b ON a.actor = b.actor AND a.dayplus = b.day ) GROUP BY day) b ON a.day = b.day

This gives me the desired results:

Note that the query uses STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day many times to convert the source string data to date. If I owned the data, I would run the ETL at the table in advance to skip this recurring step.

The query combines 2 tables:

The first table calculates how many different "actors" are present on a particular date. Note the second parameter in COUNT DISTINCT to make the score accurate.
The second table joins this day from the next day, given that the same actor is present on both days. You can then calculate how many actors are present on that day and the next day.
Combining both tables gives you both values, and you can move on to partitioning.

There are alternative methods; this is just one of many approaches. You can also optimize this query even more.

How to make subtitles in BigQuery

More articles: