Find rare events in the current window by timestamp

Given the following table:

CREATE TABLE table ( "id" serial NOT NULL, "timestamp" timestamp without time zone NOT NULL, "count" integer NOT NULL DEFAULT 0 ) 

I am looking for "rare events." A rare event is a string that has the following properties:

  • Simple: count = 1
  • Hard: all lines within a 10-minute period of time (before and after the current timestamp of the line) have count = 0 (except for this line, of course).

Example:

 id timestamp count 0 08:00 0 1 08:11 0 2 08:15 2 <== not rare event (count!=1) 3 08:19 0 4 08:24 0 5 08:25 0 6 08:29 1 <== not rare event (see 8:35) 7 08:31 0 8 08:35 1 9 08:40 0 10 08:46 1 <== rare event! 10 08:48 0 10 08:51 0 10 08:55 0 10 08:58 1 <== rare event! 10 09:02 0 10 09:09 1 

Now I have the following PL / pgSQL function:

 SELECT curr.* FROM gm_inductionloopdata curr WHERE curr.count = 1 AND ( SELECT SUM(count) FROM gm_inductionloopdata WHERE timestamp BETWEEN curr.timestamp + '10 minutes'::INTERVAL AND curr.timestamp - '10 minutes'::INTERVAL )<2 

which is dead slowly .: - (

Any suggestions on how to improve performance? I am working on> 1 mio lines here and may have to find these β€œrare events" regularly.

+4
source share
2 answers

It might be faster, but (improving @Roman's 1st solution ).

 SELECT id, ts, ct FROM ( SELECT id, ts, ct ,lag (ts, 1, '-infinity') OVER (ORDER BY ts) as prev_ts ,lead(ts, 1, 'infinity') OVER (ORDER BY ts) as next_ts FROM tbl WHERE ct <> 0 ) sub WHERE ct = 1 AND prev_ts < ts - interval '10 min' AND next_ts > ts + interval '10 min' ORDER BY ts; 
  • Handling corner cases "without leading / lagging line" can be greatly simplified by using the following two pieces of information:

  • Subqueries are generally more efficient than CTEs (some exceptions apply) because CTEs introduce optimization barriers (in design and designation). If performance matters, use CTE only when you need it.

also:

  • I use the appropriate column names instead of timestamp and count , thereby eliminating the need for double quotes. Never use reserved words or base types or function names as identifiers.

  • None of this has anything to do with , which is the default Postgres procedural language.

SQL Fiddle

Index

Since we are dealing with a large table ( > 1 mio rows ) and are only interested in β€œ rare events”, it is important that the performance be a partial index , as shown below:

 CREATE INDEX tbl_rare_idx ON tbl(ts) WHERE ct <> 0; 

If you are in Postgres 9.2 or later and set some prerequisites, make the coverage index to scan only by index .

 CREATE INDEX tbl_rare_covering_idx ON tbl(ts, ct, id) WHERE ct <> 0; 

Test with EXPLAIN ANALYZE to see which query is faster and if the index is being used.

+2
source

I think this is a good case for using the functions of the input and output window - this query filters all the records with count = 1, and then returns the next line to see if it is closer than 10 minutes:

 with cte as ( select "id", "timestamp", "count", lag("timestamp") over(w) + '10 minutes'::interval as "lag_timestamp", lead("timestamp") over(w) - '10 minutes'::interval as "lead_timestamp" from gm_inductionloopdata as curr where curr."count" <> 0 window w as (order by "timestamp") ) select "id", "timestamp" from cte where "count" = 1 and ("lag_timestamp" is null or "lag_timestamp" < "timestamp") and ("lead_timestamp" is null or "lead_timestamp" > "timestamp") 

sql fiddle demo

Or you can try this and make sure you have an index in the timestamp column of your table:

 select * from gm_inductionloopdata as curr where curr."count" = 1 and not exists ( select * from gm_inductionloopdata as g where -- you can change this to between, I've used this just for readability g."timestamp" <= curr."timestamp" + '10 minutes'::interval and g."timestamp" >= curr."timestamp" - '10 minutes'::interval and g."id" <> curr."id" and g."count" = 1 ); 

sql fiddle demo

BTW, do not call the columns "count" , "timestamp" or other keywords, function names, or type names.

+3
source

Source: https://habr.com/ru/post/1500310/


All Articles