First, the answer using standard SQL, given your hypothesis: there is an EVENT table with a simple layout:
EVENTS ----------------------------- SESION_ID , EVENT_NAME , TMST
To get a session that has completed step # 1 for some time:
-- QUERY 1 SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID;
Here I make the assumption that event1 may occur more than once per session. The result is a list of a unique session that has demonstrated event1 for some time.
To get step2 and step3, I can just do the same:
-- QUERY 2 SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID; -- QUERY 3 SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event3' GROUP BY SESSION_ID;
Now you want to select the sessions that performed step1, step2 and step3 - in that order. More precisely, you need to count the sessions that performed step 1, and then count the session that performed step2, and then count the sessions that performed step3. Basically, we just need to combine the 3 above queries with the left join to display the sessions that went into the sequence and what steps they performed:
-- FUNNEL FOR S1/S2/S3 SELECT SESSION_ID, Q1.TMST IS NOT NULL AS PERFORMED_STEP1, Q2.TMST IS NOT NULL AS PERFORMED_STEP2, Q3.TMST IS NOT NULL AS PERFORMED_STEP3 FROM -- QUERY 1 (SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID) AS Q1, LEFT JOIN -- QUERY 2 (SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q2, LEFT JOIN -- QUERY 3 (SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q3 -- Q2 & Q3 ON Q2.SESSION_ID=Q3.SESSION_ID AND Q2.TMST<Q3.TMST -- Q1 & Q2 ON Q1.SESSION_ID=Q2.SESSION_ID AND Q1.TMST<Q2.TMST
The result is a list of a unique session that introduced a funnel in step 1 and possibly continued to perform steps2 and step3 ... for example:
SESSION_ID_1,TRUE,TRUE,TRUE SESSION_ID_2,TRUE,TRUE,FALSE SESSION_ID_3,TRUE,FALSE,FALSE ...
Now we just need to calculate some statistics, for example:
SELECT STEP1_COUNT, STEP1_COUNT-STEP2_COUNT AS EXIT_AFTER_STEP1, STEP2_COUNT*100.0/STEP1_COUNT AS PERCENTAGE_TO_STEP2, STEP2_COUNT-STEP3_COUNT AS EXIT_AFTER_STEP2, STEP3_COUNT*100.0/STEP2_COUNT AS PERCENTAGE_TO_STEP3, STEP3_COUNT*100.0/STEP1_COUNT AS COMPLETION_RATE FROM (-- QUERY TO COUNT session at each step SELECT SUM(CASE WHEN PERFORMED_STEP1 THEN 1 ELSE 0 END) AS STEP1_COUNT, SUM(CASE WHEN PERFORMED_STEP2 THEN 1 ELSE 0 END) AS STEP2_COUNT, SUM(CASE WHEN PERFORMED_STEP3 THEN 1 ELSE 0 END) AS STEP3_COUNT FROM [... insert the funnel query here ...] ) AS COMPUTE_STEPS
Et voilà!
Now for the discussion. The first point, the result is quite simple, given that you are taking a “predetermined” (or functional) way of thinking, rather than a “procedural” approach. Do not visualize the database as a collection of fixed tables with columns and rows ... this is how it is implemented, but you do not interact with it. All of this installs, and you can arrange the sets as you need!
The second point is that the query will be automatically optimized for parallel operation if you use, for example, the MPP database. You don’t even have to program the query in different ways, use map reduction or something else ... I performed the same query in my test dataset with more than 100 millionth events and get results in seconds.
And last but not least, the query opens endless possibilities. Just group by the results of the abstract, keywords, landing page, user information and analysis that provides the best conversion rate, for example!