Method for finding spaces in time series data in MySQL?

Suppose we have a database table with two columns, entry_time and a value. entry_time is a timestamp, and the value can be any other data type. The entries are relatively consistent, entered after about x minutes. However, for many x times, recording cannot be performed, which creates a β€œgap” in the data.

In terms of efficiency, what is the best way to find these spaces at least at time Y (both new and old) with a query?

+6
source share
2 answers

To get started, let's summarize the number of hours records in your table.

SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour, COUNT(*) samplecount FROM table GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) 

Now, if you register something every six minutes (ten times per hour), all of your samplecount values ​​should be ten. This expression: CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) looks hairy, but it just truncates your timestamps to the hour at which they occur, resetting the minutes and seconds.

It is reasonably effective and you will begin. This is very effective if you can put the index in the entry_time column and limit your query to, say, yesterday's patterns, as shown here.

 SELECT CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) hour, COUNT(*) samplecount FROM table WHERE entry_time >= CURRENT_DATE - INTERVAL 1 DAY AND entry_time < CURRENT_DATE GROUP BY CAST(DATE_FORMAT(entry_time,'%Y-%m-%d %k:00:00') AS DATETIME) 

But it’s not so good to find whole hours that go by with missing samples. It is also slightly sensitive to jitter in your sample. That is, if your sample at the top of the hour sometimes takes half a second earlier (10:59:30), and sometimes half a second later (11:00:30), your hourly counts will be disabled. So, this hour-long summary thing (or a summary of the day, or a brief summary, etc.) is not bulletproof.

You need a request for an independent connection, so that everything is in order; it is a little bigger than a ball and not so effective.

Let's start by creating a virtual table (subquery) like this, with numbered samples. (This is a pain in MySQL; some other expensive DBMSs make work easier. It doesn't matter.)

  SELECT @sample: =@sample +1 AS entry_num, c.entry_time, c.value FROM ( SELECT entry_time, value FROM table ORDER BY entry_time ) C, (SELECT @sample:=0) s 

This small virtual table gives entry_num, entry_time, value.

The next step, we attach it to ourselves.

 SELECT one.entry_num, one.entry_time, one.value, TIMEDIFF(two.value, one.value) interval FROM ( /* virtual table */ ) ONE JOIN ( /* same virtual table */ ) TWO ON (TWO.entry_num - 1 = ONE.entry_num) 

This aligns the tables next to each other with two offsets on the same row defined by the ON clause for the JOIN.

Finally, we select values ​​from this table with interval greater than your threshold, and there are sample times right before the missing ones.

This query is used for all join requests. I told you it was a ball.

 SELECT one.entry_num, one.entry_time, one.value, TIMEDIFF(two.value, one.value) interval FROM ( SELECT @sample: =@sample +1 AS entry_num, c.entry_time, c.value FROM ( SELECT entry_time, value FROM table ORDER BY entry_time ) C, (SELECT @sample:=0) s ) ONE JOIN ( SELECT @sample2: =@sample2 +1 AS entry_num, c.entry_time, c.value FROM ( SELECT entry_time, value FROM table ORDER BY entry_time ) C, (SELECT @sample2:=0) s ) TWO ON (TWO.entry_num - 1 = ONE.entry_num) 

If you need to do this during production on a large table, you may want to do this for a subset of your data. For example, you can do this every day for samples of the previous two days. This would be decently effective, and also make sure that you did not miss the missing patterns at midnight. To do this, your small virtual tables with rolls will look like this.

  SELECT @sample: =@sample +1 AS entry_num, c.entry_time, c.value FROM ( SELECT entry_time, value FROM table ORDER BY entry_time WHERE entry_time >= CURRENT_DATE - INTERVAL 2 DAY AND entry_time < CURRENT_DATE /*yesterday but not today*/ ) C, (SELECT @sample:=0) s 
+15
source

A very efficient way to do this is through a stored procedure using cursors. I think this is simpler and more efficient than the other answers.

This procedure creates a cursor and iterates over the datetime records that you are checking. If there is a space more than indicated, it will write a space beginning and ending with a table.

  CREATE PROCEDURE findgaps() BEGIN DECLARE done INT DEFAULT FALSE; DECLARE a,b DATETIME; DECLARE cur CURSOR FOR SELECT dateTimeCol FROM targetTable ORDER BY dateTimeCol ASC; DECLARE CONTINUE HANDLER FOR NOT FOUND SET done = TRUE; OPEN cur; FETCH cur INTO a; read_loop: LOOP SET b = a; FETCH cur INTO a; IF done THEN LEAVE read_loop; END IF; IF DATEDIFF(a,b) > [range you specify] THEN INSERT INTO tmp_table (gap_begin, gap_end) VALUES (a,b); END IF; END LOOP; CLOSE cur; END; 

In this case, it is assumed that 'tmp_table' exists. You can easily define this as a TEMPORARY table in a procedure, but I left it in this example.

+1
source

Source: https://habr.com/ru/post/918374/


All Articles