How to generate multiple time series in one sql query?

Here is the database layout. I have a table with sparse sales over time, aggregated per day. If for an item I have 10 sales on 01-01-2015, I will have a record, but if I have 0, then I do not have a record. Something like that.

|--------------------------------------| | day_of_year | year | sales | item_id | |--------------------------------------| | 01 | 2015 | 20 | A1 | | 01 | 2015 | 11 | A2 | | 07 | 2015 | 09 | A1 | | ... | ... | ... | ... | |--------------------------------------| 

This is how I get the time series for 1 item.

 SELECT doy, max(sales) FROM ( SELECT day_of_year AS doy, sales AS sales FROM myschema.entry_daily WHERE item_id = theNameOfmyItem AND year = 2015 AND day_of_year < 150 UNION SELECT doy AS doy, 0 AS sales FROM generate_series(1, 149) AS doy) as t GROUP BY doy ORDER BY doy; 

And I am now looping with R, making 1 request for each element. Then I aggregate the results in a data frame. But it is very slow. I would like to have only one query that aggregates all the data in the following form.

 |----------------------------------------------| | item_id | 01 | 02 | 03 | 04 | 05 | ... | 149 | |----------------------------------------------| | A1 | 10 | 00 | 00 | 05 | 12 | ... | 11 | | A2 | 11 | 00 | 30 | 01 | 15 | ... | 09 | | A3 | 20 | 00 | 00 | 05 | 17 | ... | 20 | | ... | |----------------------------------------------| 

Is it possible? By the way, I am using a Postgres database.

+5
source share
3 answers

Solution 1. A simple request with an aggregate.

The easiest and fastest way to get the expected result. In the client program, it is easy to analyze the sales column.

 select item, string_agg(coalesce(sales, 0)::text, ',') sales from ( select distinct item_id item, doy from generate_series (1, 10) doy -- change 10 to given n cross join entry_daily ) sub left join entry_daily on item_id = item and day_of_year = doy group by 1 order by 1; item | sales ------+---------------------- A1 | 20,0,0,0,0,0,9,0,0,0 A2 | 11,0,0,0,0,0,0,0,0,0 (2 rows) 

Solution 2. Dynamically created view.

Based on solution 1 with array_agg() instead of string_agg() . The function creates a view with a given number of columns.

 create or replace function create_items_view(view_name text, days int) returns void language plpgsql as $$ declare list text; begin select string_agg(format('s[%s] "%s"', i::text, i::text), ',') into list from generate_series(1, days) i; execute(format($f$ drop view if exists %s; create view %s as select item, %s from ( select item, array_agg(coalesce(sales, 0)) s from ( select distinct item_id item, doy from generate_series (1, %s) doy cross join entry_daily ) sub left join entry_daily on item_id = item and day_of_year = doy group by 1 order by 1 ) q $f$, view_name, view_name, list, days) ); end $$; 

Using:

 select create_items_view('items_view_10', 10); select * from items_view_10; item | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 ------+----+---+---+---+---+---+---+---+---+---- A1 | 20 | 0 | 0 | 0 | 0 | 0 | 9 | 0 | 0 | 0 A2 | 11 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 (2 rows) 

Solution 3. Crosstab.

Easy to use, but very inconvenient with a large number of columns due to the need to determine the format of the row.

 create extension if not exists tablefunc; select * from crosstab ( 'select item_id, day_of_year, sales from entry_daily order by 1', 'select i from generate_series (1, 10) i' ) as ct (item_id text, "1" int, "2" int, "3" int, "4" int, "5" int, "6" int, "7" int, "8" int, "9" int, "10" int); item_id | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 ---------+----+---+---+---+---+---+---+---+---+---- A1 | 20 | | | | | | 9 | | | A2 | 11 | | | | | | | | | (2 rows) 
+3
source

First you need a table with all dates to fill in the empty dates. A 100-year date means 36,000 rows, so it’s not very big. Instead of calculating every time.

allDates:

 date_id s_date 

or created a field calculation

 date_id s_date doy = EXTRACT(DOY FROM s_date) year = EXTRACT(YEAR FROM s_date) 

Your basic query will be SQL FIDDLE DEMO :

 SELECT AD.year, AD.doy, allitems.item_id, COALESCE(SUM(ED.sales), 0) as max_sales FROM (SELECT DISTINCT item_id FROM entry_daily ) as allitems CROSS JOIN alldates AD LEFT JOIN entry_daily ED ON ED.day_of_year = AD.doy AND ED.year = AD.year AND ED.item_id = allitems.item_id WHERE AD.year = 2015 GROUP BY AD.year, AD.doy, allitems.item_id ORDER BY AD.year, AD.doy, allitems.item_id 

You will have this OUTPUT

 | year | doy | item_id | max_sales | |------|-----|---------|-----------| | 2015 | 1 | A1 | 20 | | 2015 | 1 | A2 | 11 | | 2015 | 2 | A1 | 0 | | 2015 | 2 | A2 | 0 | | 2015 | 3 | A1 | 0 | | 2015 | 3 | A2 | 0 | | 2015 | 4 | A1 | 0 | | 2015 | 4 | A2 | 0 | | 2015 | 5 | A1 | 0 | | 2015 | 5 | A2 | 0 | | 2015 | 6 | A1 | 0 | | 2015 | 6 | A2 | 0 | | 2015 | 7 | A1 | 39 | | 2015 | 7 | A2 | 0 | | 2015 | 8 | A1 | 0 | | 2015 | 8 | A2 | 0 | | 2015 | 9 | A1 | 0 | | 2015 | 9 | A2 | 0 | | 2015 | 10 | A1 | 0 | | 2015 | 10 | A2 | 0 | 

Then you need to install tablefunc

and use a crosstab to rotate this SAMPLE table

+1
source

Try this standalone code in which we used 5 instead of 149 to keep the output code short.

In (1), we use a single SQL statement, if required, to generate all the series that produce the long form result. Typically, relational databases use a long form rather than a wide form, and this form may be preferable, but in case we do not follow this with the conversion to wide form using the reshape2 package.

In (2), we show how to replace an SQL statement with R code that uses the dplyr package.

1) PostgreSQL Regarding the SQL statement below, the innermost selection generates a table 1, 2, ..., 5, whose column is day_of_year , which is a cross attached to entry_daily , each combination of day_of_year with the year and element and saving only individual rows. Then it connects to entry_daily to get the sales numbers that we summarize.

Assuming you have configured postgreSQL to work with SQL, as in FAQ # 12 on the sqldf homepage ( https://github.com/ggrothendieck/sqldf ), the following illustrates this and is a self-contained code that you can simply copy and paste into your session.

 library(sqldf) library(RPostgreSQL) # input data entry_daily <- structure(list(day_of_year = c(1L, 1L, 7L), year = c(2015L, 2015L, 2015L), sales = c(20L, 11L, 9L), item_id = structure(c(1L, 2L, 1L), .Label = c("A1", "A2"), class = "factor")), .Names = c("day_of_year", "year", "sales", "item_id"), class = "data.frame", row.names = c(NA, -3L)) s <- sqldf("select A.item_id, A.year, A.day_of_year, sum(coalesce(B.sales, 0)) sales from (select distinct x.day_of_year, y.year, y.item_id from (select * from generate_series(1, 5) as day_of_year) as x cross join entry_daily as y) as A left join entry_daily as B on A.year = B.year and A.day_of_year = B.day_of_year and A.item_id = B.item_id where A.year = 2015 group by A.item_id, A.year, A.day_of_year order by A.item_id, A.year, A.day_of_year") 

The output of the above query is data.frame:

 > s item_id year day_of_year sales 1 A1 2015 1 20 2 A1 2015 2 0 3 A1 2015 3 0 4 A1 2015 4 0 5 A1 2015 5 0 6 A2 2015 1 11 7 A2 2015 2 0 8 A2 2015 3 0 9 A2 2015 4 0 10 A2 2015 5 0 

If you really need this in broad form, we can do it in R using the dcast in the reshape2 package:

 library(reshape2) dcast(s, item_id + year ~ day_of_year, value.var = "sales") 

giving:

  item_id year 1 2 3 4 5 1 A1 2015 20 0 0 0 0 2 A2 2015 11 0 0 0 0 

2) dplyr Note that, as an alternative to the SQL statement, this R code will compute s :

 library(dplyr) s2 <- expand.grid(item_id = unique(entry_daily$item_id), year = 2015, day_of_year = 1:5) %>% left_join(entry_daily) %>% group_by(item_id, year, day_of_year) %>% summarize(sales = sum(sales, na.rm = TRUE)) %>% ungroup() %>% arrange(item_id, year, day_of_year) 

giving:

 > s2 Joining by: c("item_id", "year", "day_of_year") Source: local data frame [10 x 4] Groups: item_id, year [?] item_id year day_of_year sales (fctr) (dbl) (int) (int) 1 A1 2015 1 20 2 A1 2015 2 0 3 A1 2015 3 0 4 A1 2015 4 0 5 A1 2015 5 0 6 A2 2015 1 11 7 A2 2015 2 0 8 A2 2015 3 0 9 A2 2015 4 0 10 A2 2015 5 0 

Now, if desired, use the same dcast as in (1).

0
source

Source: https://habr.com/ru/post/1234441/


All Articles