Let's say I have a list of events related to viewing pages, each of which has a session identifier. For each event, I want to add the time and URL of the chronologically first pageview in this event session. For example, suppose my events are in a table testthat looks like this:
uid | session_id | timestamp | url
----------------------------------------------------
u1 0 0 a.com/
u1 1 1 a.com/p1
u1 1 2 a.com/p2
I need a SQL command that creates the following:
uid | session_id | timestamp | url | s_timestamp | s_url
---------------------------------------------------------------------
u1 0 0 a.com/ 0 a.com/
u1 1 1 a.com/p1 1 a.com/p1
u1 1 2 a.com/p2 1 a.com/p1
The window functions seem to be here, but I'm new to them. The following statement creates the desired table, but I wonder if it is suboptimal
SELECT
uid,
session_id,
timestamp,
url,
first_value(url) OVER (PARTITION BY uid, session_id ORDER BY timestamp ASC) s_url,
first_value(timestamp) OVER (PARTITION BY uid, session_id ORDER BY timestamp ASC) s_timestamp
FROM test
, , OVER. URL- , OVER? SPARK SQL, , SQL.