Sequence analysis analysis, how would you calculate the funnel?

Suppose I track the “event” that the user receives on the website, the events may be as follows:

  • pages viewed
  • added item to cart
  • Photo
  • paid for order

Now each of these events is stored in a database, for example:

session_id event_name created_date ..

So now I want to create a report to display a specific sequence, which I will define as:

Step#1 event_n Step#2 event_n2 Step#3 event_n3 

Thus, this particular funnel has 3 steps, and each step is associated with ANY event.

How can I create a report for this, given the above data?

Note: I just want to be clear, I want to be able to create any funnel that I define, and to be able to create a report for it.

The easiest way I can think of:

  • get all events for every step that I have in my database
  • step # 1 will be, x% of people completed event_n
  • Now I will need to request data for step number 2, which also completed step # 1, and display%
  • Same as # 3, but for step # 3 with the condition for step # 2

I'm curious how these online services can display these types of reports in a hosted Saas environment. Can map-reduce make this easier?

+6
source share
3 answers

First, the answer using standard SQL, given your hypothesis: there is an EVENT table with a simple layout:

 EVENTS ----------------------------- SESION_ID , EVENT_NAME , TMST 

To get a session that has completed step # 1 for some time:

 -- QUERY 1 SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID; 

Here I make the assumption that event1 may occur more than once per session. The result is a list of a unique session that has demonstrated event1 for some time.

To get step2 and step3, I can just do the same:

 -- QUERY 2 SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID; -- QUERY 3 SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event3' GROUP BY SESSION_ID; 

Now you want to select the sessions that performed step1, step2 and step3 - in that order. More precisely, you need to count the sessions that performed step 1, and then count the session that performed step2, and then count the sessions that performed step3. Basically, we just need to combine the 3 above queries with the left join to display the sessions that went into the sequence and what steps they performed:

 -- FUNNEL FOR S1/S2/S3 SELECT SESSION_ID, Q1.TMST IS NOT NULL AS PERFORMED_STEP1, Q2.TMST IS NOT NULL AS PERFORMED_STEP2, Q3.TMST IS NOT NULL AS PERFORMED_STEP3 FROM -- QUERY 1 (SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event1' GROUP BY SESSION_ID) AS Q1, LEFT JOIN -- QUERY 2 (SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q2, LEFT JOIN -- QUERY 3 (SELECT SESSION_ID,MIN(TMST) FROM EVENTS WHERE EVENT_NAME='event2' GROUP BY SESSION_ID) AS Q3 -- Q2 & Q3 ON Q2.SESSION_ID=Q3.SESSION_ID AND Q2.TMST<Q3.TMST -- Q1 & Q2 ON Q1.SESSION_ID=Q2.SESSION_ID AND Q1.TMST<Q2.TMST 

The result is a list of a unique session that introduced a funnel in step 1 and possibly continued to perform steps2 and step3 ... for example:

 SESSION_ID_1,TRUE,TRUE,TRUE SESSION_ID_2,TRUE,TRUE,FALSE SESSION_ID_3,TRUE,FALSE,FALSE ... 

Now we just need to calculate some statistics, for example:

 SELECT STEP1_COUNT, STEP1_COUNT-STEP2_COUNT AS EXIT_AFTER_STEP1, STEP2_COUNT*100.0/STEP1_COUNT AS PERCENTAGE_TO_STEP2, STEP2_COUNT-STEP3_COUNT AS EXIT_AFTER_STEP2, STEP3_COUNT*100.0/STEP2_COUNT AS PERCENTAGE_TO_STEP3, STEP3_COUNT*100.0/STEP1_COUNT AS COMPLETION_RATE FROM (-- QUERY TO COUNT session at each step SELECT SUM(CASE WHEN PERFORMED_STEP1 THEN 1 ELSE 0 END) AS STEP1_COUNT, SUM(CASE WHEN PERFORMED_STEP2 THEN 1 ELSE 0 END) AS STEP2_COUNT, SUM(CASE WHEN PERFORMED_STEP3 THEN 1 ELSE 0 END) AS STEP3_COUNT FROM [... insert the funnel query here ...] ) AS COMPUTE_STEPS 

Et voilà!

Now for the discussion. The first point, the result is quite simple, given that you are taking a “predetermined” (or functional) way of thinking, rather than a “procedural” approach. Do not visualize the database as a collection of fixed tables with columns and rows ... this is how it is implemented, but you do not interact with it. All of this installs, and you can arrange the sets as you need!

The second point is that the query will be automatically optimized for parallel operation if you use, for example, the MPP database. You don’t even have to program the query in different ways, use map reduction or something else ... I performed the same query in my test dataset with more than 100 millionth events and get results in seconds.

And last but not least, the query opens endless possibilities. Just group by the results of the abstract, keywords, landing page, user information and analysis that provides the best conversion rate, for example!

+7
source

The main problem with how you think about this is what you think in a model like SQL / table. Each event is one record. One of the nice things about NoSQL technologies (which you pay special attention to) is that you can save a record as one session per record. Once you store the data based on the session, you can write a procedure that checks whether the session matches the template or not. There is no need to make connections or anything else, just iterate over the list of transactions in the session. This is the power of semi-structured data.

What if you save your sessions together? Then all you have to do is iterate over each session and see if it matches.

This is a fantastic precedent for HBase, in my opinion.

With HBase, you store the session identifier as a row key, and then each of the events as timestamped values ​​as a column separator. As a result, you get data that is grouped by session ID, then sorted by time.

So now you want to find out in which% of the sessions the behavior 1, then 2, then 3. was entered. You run the MapReduce task on this data. The MapReduce task will provide you with a key / value pair for each row. Write a loop over the data to check if it matches the pattern. If he counts + 1, if not, do not do this.


Without HBase, you can use MapReduce to synchronize your unorganized data at rest. Group by session ID, then in the reducer you will have all the events associated with this session, grouped together. Now you mainly work with HBase, where you can write a method in the reducer that validates the pattern.


HBase may be redundant if you don't have a ridiculous amount of data. Any such database that can store data hierarchically would be good in this situation. MongoDB, Cassandra, Redis all come to mind and have their own strengths and weaknesses.

+2
source

I recently released open source Hive UDF for this: hive-funnel-udf

It is quite easy to use for this kind of sequence analysis tasks, you can just write Hive, no need to write your own Java MapReduce code.

This will only work if you use Hive / Hadoop to store and query your data.

0
source

Source: https://habr.com/ru/post/915575/


All Articles