Find rare events in the current window by timestamp

Question

Find rare events in the current window by timestamp

Given the following table:

CREATE TABLE table ( "id" serial NOT NULL, "timestamp" timestamp without time zone NOT NULL, "count" integer NOT NULL DEFAULT 0 )

I am looking for "rare events." A rare event is a string that has the following properties:

Simple: count = 1
Hard: all lines within a 10-minute period of time (before and after the current timestamp of the line) have count = 0 (except for this line, of course).

Example:

 id timestamp count 0 08:00 0 1 08:11 0 2 08:15 2 <== not rare event (count!=1) 3 08:19 0 4 08:24 0 5 08:25 0 6 08:29 1 <== not rare event (see 8:35) 7 08:31 0 8 08:35 1 9 08:40 0 10 08:46 1 <== rare event! 10 08:48 0 10 08:51 0 10 08:55 0 10 08:58 1 <== rare event! 10 09:02 0 10 09:09 1

Now I have the following PL / pgSQL function:

 SELECT curr.* FROM gm_inductionloopdata curr WHERE curr.count = 1 AND ( SELECT SUM(count) FROM gm_inductionloopdata WHERE timestamp BETWEEN curr.timestamp + '10 minutes'::INTERVAL AND curr.timestamp - '10 minutes'::INTERVAL )<2

which is dead slowly .: - (

Any suggestions on how to improve performance? I am working on> 1 mio lines here and may have to find these “rare events" regularly.

+4

performance sql timestamp postgresql window-functions

Ronk Sep 03 '13 at 13:43

source share

2 answers

I think this is a good case for using the functions of the input and output window - this query filters all the records with count = 1, and then returns the next line to see if it is closer than 10 minutes:

 with cte as ( select "id", "timestamp", "count", lag("timestamp") over(w) + '10 minutes'::interval as "lag_timestamp", lead("timestamp") over(w) - '10 minutes'::interval as "lead_timestamp" from gm_inductionloopdata as curr where curr."count" <> 0 window w as (order by "timestamp") ) select "id", "timestamp" from cte where "count" = 1 and ("lag_timestamp" is null or "lag_timestamp" < "timestamp") and ("lead_timestamp" is null or "lead_timestamp" > "timestamp")

sql fiddle demo

Or you can try this and make sure you have an index in the timestamp column of your table:

 select * from gm_inductionloopdata as curr where curr."count" = 1 and not exists ( select * from gm_inductionloopdata as g where -- you can change this to between, I've used this just for readability g."timestamp" <= curr."timestamp" + '10 minutes'::interval and g."timestamp" >= curr."timestamp" - '10 minutes'::interval and g."id" <> curr."id" and g."count" = 1 );

sql fiddle demo

BTW, do not call the columns "count" , "timestamp" or other keywords, function names, or type names.

+3

Roman pekar Sep 03 '13 at 14:04

source share

Erwin brandstetter · Accepted Answer · 2013-09-04T20:05:50+0000

It might be faster, but (improving @Roman's 1st solution ).

 SELECT id, ts, ct FROM ( SELECT id, ts, ct ,lag (ts, 1, '-infinity') OVER (ORDER BY ts) as prev_ts ,lead(ts, 1, 'infinity') OVER (ORDER BY ts) as next_ts FROM tbl WHERE ct <> 0 ) sub WHERE ct = 1 AND prev_ts < ts - interval '10 min' AND next_ts > ts + interval '10 min' ORDER BY ts;

Handling corner cases "without leading / lagging line" can be greatly simplified by using the following two pieces of information:
- Postgres knows the special time values -infinity and infinity .
- lead() and lag() support default values.
Subqueries are generally more efficient than CTEs (some exceptions apply) because CTEs introduce optimization barriers (in design and designation). If performance matters, use CTE only when you need it.

also:

I use the appropriate column names instead of timestamp and count , thereby eliminating the need for double quotes. Never use reserved words or base types or function names as identifiers.
None of this has anything to do with ~~plpgsql~~ , which is the default Postgres procedural language.

SQL Fiddle

Index

Since we are dealing with a large table ( > 1 mio rows ) and are only interested in “ rare events”, it is important that the performance be a partial index , as shown below:

 CREATE INDEX tbl_rare_idx ON tbl(ts) WHERE ct <> 0;

If you are in Postgres 9.2 or later and set some prerequisites, make the coverage index to scan only by index .

 CREATE INDEX tbl_rare_covering_idx ON tbl(ts, ct, id) WHERE ct <> 0;

The order of the columns is important. ts should be first, ct should be next. Other columns that you need in SELECT follow.
Read the index page only in the Postgres wiki page .

Test with EXPLAIN ANALYZE to see which query is faster and if the index is being used.

Find rare events in the current window by timestamp

Index

More articles: