Collapse multiple lines with adjacent timestamps

I have a table representing the use of a product, like a log. Product usage is recorded as multiple timestamps, I want to present the same data using time ranges.

It looks like this (PostgreSQL 9.1):

userid | timestamp | product ------------------------------------- 001 | 2012-04-23 9:12:05 | foo 001 | 2012-04-23 9:12:07 | foo 001 | 2012-04-23 9:12:09 | foo 001 | 2012-04-23 9:12:11 | barbaz 001 | 2012-04-23 9:12:13 | barbaz 001 | 2012-04-23 9:15:00 | barbaz 001 | 2012-04-23 9:15:01 | barbaz 002 | 2012-04-24 3:41:01 | foo 002 | 2012-04-24 3:41:03 | foo 

I want to collapse lines whose time difference with the previous run is less than the delta (say: 2 seconds), and get the start time and end time, for example:

 userid | begin | end | product ---------------------------------------------------------- 001 | 2012-04-23 9:12:05 | 2012-04-23 9:12:09 | foo 001 | 2012-04-23 9:12:11 | 2012-04-23 9:12:13 | barbaz 001 | 2012-04-23 9:15:00 | 2012-04-23 9:15:01 | barbaz 002 | 2012-04-24 3:41:01 | 2012-04-24 3:41:03 | foo 

Note that sequential use of the same product is split into two lines if their use is greater than delta (in this example, 2 seconds).

 create table t (userid int, timestamp timestamp, product text); insert into t (userid, timestamp, product) values (001, '2012-04-23 9:12:05', 'foo'), (001, '2012-04-23 9:12:07', 'foo'), (001, '2012-04-23 9:12:09', 'foo'), (001, '2012-04-23 9:12:11', 'barbaz'), (001, '2012-04-23 9:12:13', 'barbaz'), (001, '2012-04-23 9:15:00', 'barbaz'), (001, '2012-04-23 9:15:01', 'barbaz'), (002, '2012-04-24 3:41:01', 'foo'), (002, '2012-04-24 3:41:03', 'foo') ; 
+6
source share
1 answer

Inspired by this answer , a while back at @a_horse_with_no_name.

 WITH groupped_t AS ( SELECT *, sum(grp_id) OVER (ORDER BY userid,product,"timestamp") AS grp_nr FROM (SELECT t.*, lag("timestamp") OVER (PARTITION BY userid,product ORDER BY "timestamp") AS prev_ts, CASE WHEN ("timestamp" - lag("timestamp") OVER (PARTITION BY userid,product ORDER BY "timestamp")) <= '2s'::interval THEN NULL ELSE 1 END AS grp_id FROM t) AS g ), periods AS ( SELECT min(gt."timestamp") AS grp_min, max(gt."timestamp") AS grp_max, grp_nr FROM groupped_t AS gt GROUP BY gt.grp_nr ) SELECT gt.userid, p.grp_min AS "begin", p.grp_max AS "end", gt.product FROM periods p JOIN groupped_t gt ON gt.grp_nr = p.grp_nr AND gt."timestamp" = p.grp_min ORDER BY gt.userid, p.grp_min; 
  • The innermost query assigns grouping identifiers based on userid , product and time differences. I suggested that this should be safe for the PARTITION BY first two fields.
  • groupped_t gives me all the source columns + an extra workgroup number. I used only ORDER BY for the sum() window function, since I need the group identifiers to be unique.
  • periods is just an auxiliary query for the first and last timestamp in each group.
  • Finally, I join groupped_t with periods in grp_nr (so I need it to be unique) and the label of the first record in each group.

You can also check this query for SQL Fiddle .

Note that timestamp , begin and end reserved words in SQL ( end also for PostgreSQL), so you should either avoid or duplicate them.

+6
source

Source: https://habr.com/ru/post/918925/


All Articles