Average inventory table

I have a table that tracks inventory changes over time for some stores and products. The value is absolute stock, but we only insert a new row when stock changes. This design was supposed to keep the table small, because it is expected to grow rapidly.

This is an example diagram and some test data:

CREATE TABLE stocks ( id serial NOT NULL, store_id integer NOT NULL, product_id integer NOT NULL, date date NOT NULL, value integer NOT NULL, CONSTRAINT stocks_pkey PRIMARY KEY (id), CONSTRAINT stocks_store_id_product_id_date_key UNIQUE (store_id, product_id, date) ); insert into stocks(store_id, product_id, date, value) values (1,10,'2013-01-05', 4), (1,10,'2013-01-09', 7), (1,10,'2013-01-11', 5), (1,11,'2013-01-05', 8), (2,10,'2013-01-04', 12), (2,11,'2012-12-04', 23); 

I need to determine the average margin between the start and end dates for each product and store, but my problem is that simple avg () does not take into account that the margin remains unchanged between the changes.

What I would like is something like this:

 select s.store_id, s.product_id , special_avg(s.value) from stocks s where s.date between '2013-01-01' and '2013-01-15' group by s.store_id, s.product_id 

resulting in something like this:

 store_id product_id avg 1 10 3.6666666667 1 11 5.8666666667 2 10 9.6 2 11 23 

To use the middle SQL function, I will need to โ€œpush forwardโ€ the previous value for store_id and product_id forward in time until a new change occurs. Any ideas on how to achieve this?

+3
sql postgresql average window-functions date-range
Aug 11 '14 at 16:22
source share
3 answers

special difficulty of this task: you cannot just select data points within your time range, but you must take into account the last data point before the time range and the earliest data point after the time range additionally. It depends on each row, and each data point may or may not exist exist. It requires a complex query and makes it difficult to use indexes.

You can use range types and (Postgres 9.2 + ) to simplify the calculations:

 WITH input(a,b) AS (SELECT '2013-01-01'::date -- your time frame here , '2013-01-15'::date) -- inclusive borders SELECT store_id, product_id , sum(upper(days) - lower(days)) AS days_in_range , round(sum(value * (upper(days) - lower(days)))::numeric / (SELECT b-a+1 FROM input), 2) AS your_result , round(sum(value * (upper(days) - lower(days)))::numeric / sum(upper(days) - lower(days)), 2) AS my_result FROM ( SELECT store_id, product_id, value, s.day_range * x.day_range AS days FROM ( SELECT store_id, product_id, value , daterange (day, lead(day, 1, now()::date) OVER (PARTITION BY store_id, product_id ORDER BY day)) AS day_range FROM stock ) s JOIN ( SELECT daterange(a, b+1) AS day_range FROM input ) x ON s.day_range && x.day_range ) sub GROUP BY 1,2 ORDER BY 1,2; 

Note. I use the column name day instead of date . I never use base type names as column names.

In the sub subquery, I select the day from the next line for each element using the lead() window function, using the built-in option to provide "today" by default when there is no next line.
With this, I daterange and match it with the input with the overlap operator && , calculating the resulting date range using the intersection operator * .

All ranges are located with an exclusive upper bound. Therefore, I add one day to the input range. Thus, we can simply subtract lower(range) from upper(range) to get the number of days.

I assume that "yesterday" is the last day with reliable data. Today can still change in real life. Therefore, I use today ( now()::date ) as an exclusive upper bound for open ranges.

I provide two results:

  • your_result is consistent with your displayed results.
    You divide by the number of days in your date range unconditionally. For example, if the item is specified only for the last day, you get a very low (misleading!) "Medium".

  • my_result computes the same or higher numbers.
    I am dividing by the actual number of days on which the item is indicated. For example, if the item is specified only for the last day, I return the specified value as an average.

To understand the difference, I added the number of days on which the item was specified: days_in_range

SQL Fiddle

Index and Performance

For such data, the old rows usually do not change. This would be a great example for a materialized view :

 CREATE MATERIALIZED VIEW mv_stock AS SELECT store_id, product_id, value , daterange (day, lead(day, 1, now()::date) OVER (PARTITION BY store_id, product_id ORDER BY day)) AS day_range FROM stock; 

Then you can add a GiST index that supports the corresponding && operator :

 CREATE INDEX mv_stock_range_idx ON mv_stock USING gist (day_range); 

Great test case

I conducted a more realistic test with 200 thousand lines. A query using MV was about 6 times faster, which in turn was ~ 10 times faster than a @Joop request. Performance is heavily dependent on data distribution. MV helps most with large tables and high recording rates. Also, if there are columns in the table that are not relevant to this query, the MV may be smaller. A question of cost and benefit.

I put all the solutions published so far (and adapted) in a large violin to play with:

SQL Fiddle with a great test case.
SQL Fiddle in just 40k rows - to avoid sqlfiddle.com timeout

+3
Aug 11 '14 at
source share

It's pretty quick and dirty: instead of doing arithmetic of unpleasant intervals, just join the calendar table and sum them all up.

 WITH calendar(zdate) AS ( SELECT generate_series('2013-01-01'::date, '2013-01-15'::date, '1 day'::interval)::date ) SELECT st.store_id,st.product_id , SUM(st.zvalue) AS sval , COUNT(*) AS nval , (SUM(st.zvalue)::decimal(8,2) / COUNT(*) )::decimal(8,2) AS wval FROM calendar JOIN stocks st ON calendar.zdate >= st.zdate AND NOT EXISTS ( -- this calendar entry belongs to the next stocks entry SELECT * FROM stocks nx WHERE nx.store_id = st.store_id AND nx.product_id = st.product_id AND nx.zdate > st.zdate AND nx.zdate <= calendar.zdate ) GROUP BY st.store_id,st.product_id ORDER BY st.store_id,st.product_id ; 
+3
Aug 12 '14 at 11:54 on
source share

This answer is based on the supposed idea that you are looking for an average over several days, so every day is considered a new line. Although this can be processed as a string in other SQL machines, it was easier to implement by dividing Average (Sum (value) / count (value)) and extrapolating it to the number of days in that value. using your table format and this purpose, I came up with this solution ( SQLFiddle )

 select store_id, product_id, CASE WHEN sum(nextdate-date) > 0 THEN sum(Value*(nextdate-date)) / sum(nextdate-date) END as Avg_Value from ( select * , ( select value from stocks b where a.store_id = b.store_id and a.product_id = b.product_id and a.date >= b.date order by b.date limit 1 )*1.0 "value" , coalesce(( select date from stocks b where a.store_id = b.store_id and a.product_id = b.product_id and a.date < b.date order by b.date limit 1 ),case when current_date > '2013-01-12' then '2013-01-12' else current_date end) nextdate from ( select store_id, product_id, min(case when date < '2013-01-07' then '2013-01-07' else date end) date from stocks z where date < '2013-01-12' group by store_id, product_id ) a union all select store_id, product_id, date, value*1.0 "value" , coalesce(( select date from stocks b where a.store_id = b.store_id and a.product_id = b.product_id and a.date < b.date order by b.date limit 1 ),case when current_date > '2013-01-12' then '2013-01-12' else current_date end) nextdate from stocks a where a.date between '2013-01-07' and '2013-01-12' ) t group by store_id, product_id ; 

The query takes the first occurrence of each store / product before the launch parameter ( '2013-01-07' ) and swaps the parameter as a date, if it is more than the date of the table entry, selects the value for this early record and the date of the first change in the table after the parameter start and holds the next date with the final parameter ( '2013-01-12' ). The second part of the merge request captures all changes between the two parameters, as well as the next change or current date, both limited to the final parameter. Finally, the calculation is performed based on the results when the values โ€‹โ€‹are multiplied by the difference in dates, when they are summed, divided by the sum of days between the dates. Since all dates are limited in the query, the average value will be the average of the exact window passed as parameters.

Not all of this is in PostgreSQL, I advise you if you plan to implement this in a function by copying this query and replacing all instances of '2013-01-07' with the name of your initial parameter and all instances of '2013-01-12' with your end name The parameter will give you the results you are looking for for any date window.

Edit: if you want to average over a different unit of time, simply replace the two nextdate-date instances with whatever time calculation you are looking for. nextdate-date returns the number of days between them.

0
Aug 11 '14 at 18:03
source share



All Articles