Imagine I have a table like this:
CREATE TABLE time_series (
snapshot_date DATE,
sales INTEGER,
PRIMARY KEY (snapshot_date));
With values like this:
INSERT INTO time_series SELECT '2017-01-01'::DATE AS snapshot_date,10 AS sales;
INSERT INTO time_series SELECT '2017-01-02'::DATE AS snapshot_date,4 AS sales;
INSERT INTO time_series SELECT '2017-01-03'::DATE AS snapshot_date,13 AS sales;
INSERT INTO time_series SELECT '2017-01-04'::DATE AS snapshot_date,7 AS sales;
INSERT INTO time_series SELECT '2017-01-05'::DATE AS snapshot_date,15 AS sales;
INSERT INTO time_series SELECT '2017-01-06'::DATE AS snapshot_date,8 AS sales;
I would like to be able to do this:
SELECT a.snapshot_date,
AVG(b.sales) AS sales_avg,
COUNT(*) AS COUNT
FROM time_series AS a
JOIN time_series AS b
ON a.snapshot_date > b.snapshot_date
GROUP BY a.snapshot_date
What gives such results:
*---------------*-----------*-------*
| snapshot_date | sales_avg | count |
*---------------*-----------*-------*
| 2017-01-02 | 10.0 | 1 |
| 2017-01-03 | 7.0 | 2 |
| 2017-01-04 | 9.0 | 3 |
| 2017-01-05 | 8.5 | 4 |
| 2017-01-06 | 9.8 | 5 |
-------------------------------------
With a trivial number of rows, as in this example, the query is very fast. The problem is that I have to do this for millions of lines, and in Redshift (similar to Postgres syntax) my request lasts several days. This is terribly slow, and yet it is one of my most common query patterns. I suspect that the problem is due to the growth of O (n ^ 2) in the data compared to the more preferred O (n).
The implementation of My O (n) in python will be something like this:
rows = [('2017-01-01',10),
('2017-01-02',4),
('2017-01-03',13),
('2017-01-04',7),
('2017-01-05',15),
('2017-01-06',8)]
sales_total_previous = 0
count = 0
for index, row in enumerate(rows):
snapshot_date = row[0]
sales = row[1]
if index == 0:
sales_total_previous += sales
continue
count += 1
sales_avg = sales_total_previous / count
print((snapshot_date,sales_avg, count))
sales_total_previous += sales
With the following results (same as SQL query):
('2017-01-02', 10.0, 1)
('2017-01-03', 7.0, 2)
('2017-01-04', 9.0, 3)
('2017-01-05', 8.5, 4)
('2017-01-06', 9.8, 5)
Apache Spark, python, ( 3-4 ) Spark- 100 . O (n) SQL, Postgres/Redshift?