The most efficient way to combine two time series

Question

The most efficient way to combine two time series

Imagine I have a table like this:

 CREATE TABLE time_series (
        snapshot_date DATE,
        sales INTEGER,
PRIMARY KEY (snapshot_date));

With values like this:

INSERT INTO time_series SELECT '2017-01-01'::DATE AS snapshot_date,10 AS sales;
INSERT INTO time_series SELECT '2017-01-02'::DATE AS snapshot_date,4 AS sales;
INSERT INTO time_series SELECT '2017-01-03'::DATE AS snapshot_date,13 AS sales;
INSERT INTO time_series SELECT '2017-01-04'::DATE AS snapshot_date,7 AS sales;
INSERT INTO time_series SELECT '2017-01-05'::DATE AS snapshot_date,15 AS sales;
INSERT INTO time_series SELECT '2017-01-06'::DATE AS snapshot_date,8 AS sales;

I would like to be able to do this:

SELECT a.snapshot_date, 
       AVG(b.sales) AS sales_avg,
       COUNT(*) AS COUNT
  FROM time_series AS a
  JOIN time_series AS b
       ON a.snapshot_date > b.snapshot_date
 GROUP BY a.snapshot_date

What gives such results:

*---------------*-----------*-------*
| snapshot_date | sales_avg | count |
*---------------*-----------*-------*
|  2017-01-02   |   10.0    |    1  |
|  2017-01-03   |   7.0     |    2  |
|  2017-01-04   |   9.0     |    3  |
|  2017-01-05   |   8.5     |    4  |
|  2017-01-06   |   9.8     |    5  |
-------------------------------------

With a trivial number of rows, as in this example, the query is very fast. The problem is that I have to do this for millions of lines, and in Redshift (similar to Postgres syntax) my request lasts several days. This is terribly slow, and yet it is one of my most common query patterns. I suspect that the problem is due to the growth of O (n ^ 2) in the data compared to the more preferred O (n).

The implementation of My O (n) in python will be something like this:

rows = [('2017-01-01',10),
        ('2017-01-02',4),
        ('2017-01-03',13),
        ('2017-01-04',7),
        ('2017-01-05',15),
        ('2017-01-06',8)]
sales_total_previous = 0
count = 0
for index, row in enumerate(rows):
    snapshot_date = row[0]
    sales = row[1]
    if index == 0:
        sales_total_previous += sales
        continue
    count += 1
    sales_avg = sales_total_previous / count
    print((snapshot_date,sales_avg, count))
    sales_total_previous += sales

With the following results (same as SQL query):

('2017-01-02', 10.0, 1)
('2017-01-03', 7.0, 2)
('2017-01-04', 9.0, 3)
('2017-01-05', 8.5, 4)
('2017-01-06', 9.8, 5)

Apache Spark, python, ( 3-4 ) Spark- 100 . O (n) SQL, Postgres/Redshift?

+4

python sql postgresql amazon-redshift

user554481 04 . '18 21:39

1

Gordon Linoff · Accepted Answer · 2018-01-04T21:45:21+0000

, :

SELECT ts.snapshot_date, 
       AVG(ts.sales) OVER (ORDER BY ts.snapshot_date) AS sales_avg,
       ROW_NUMBER() OVER (ORDER BY ts.snapshot_date) AS COUNT
FROM time_series ts;

, .

The most efficient way to combine two time series

More articles: