How about this public data request:
SELECT a.day, first_day, return_next_day, integer((return_next_day / first_day) * 100) percent FROM ( SELECT COUNT(DISTINCT actor, 50000) first_day, STRFTIME_UTC_USEC( UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day, FROM [publicdata:samples.github_timeline] GROUP BY day) a JOIN ( SELECT COUNT(*) return_next_day, day FROM ( SELECT a.day day, a.actor, b.day, b.actor FROM ( SELECT STRFTIME_UTC_USEC( UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day, MAX(STRFTIME_UTC_USEC(86400000000 + UTC_USEC_TO_DAY( PARSE_UTC_USEC(created_at)), "%Y-%m-%d")) dayplus, actor FROM [publicdata:samples.github_timeline] GROUP EACH BY actor, day) a JOIN EACH ( SELECT STRFTIME_UTC_USEC( UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day, actor FROM [publicdata:samples.github_timeline] GROUP EACH BY actor, day) b ON a.actor = b.actor AND a.dayplus = b.day ) GROUP BY day) b ON a.day = b.day
This gives me the desired results:

Note that the query uses STRFTIME_UTC_USEC(UTC_USEC_TO_DAY(PARSE_UTC_USEC(created_at)), "%Y-%m-%d") day many times to convert the source string data to date. If I owned the data, I would run the ETL at the table in advance to skip this recurring step.
The query combines 2 tables:
The first table calculates how many different "actors" are present on a particular date. Note the second parameter in COUNT DISTINCT to make the score accurate.
The second table joins this day from the next day, given that the same actor is present on both days. You can then calculate how many actors are present on that day and the next day.
Combining both tables gives you both values, and you can move on to partitioning.
There are alternative methods; this is just one of many approaches. You can also optimize this query even more.
source share