Aggregate acceleration is usually done by storing additional tables.
Assuming a detail(id, dimA, dimB, dimC, value) table detail(id, dimA, dimB, dimC, value) , if you want the performance of AVG (or other aggregate functions) to be almost constant, regardless of the number of records, you could introduce a new table
dimAavg(dimA, avgValue)
- The size of this table will depend only on the number of different dimA values (in addition, this table may make sense in your design, since it may contain the range of values available for dimA in detail (and other attributes related to the domain value you can / should have such a table)
- This table is only useful if you will only process dimA, as soon as you need AVG (value) according to dimA and dimB, it becomes useless. So, you need to know what attributes you want to quickly analyze. The number of rows required to store aggregates by several attributes is
n(dimA) xn(dimB) xn(dimC) x ... , which may or may not grow quite quickly. - Maintaining this table increases the costs of upgrades (including insertions and deletions), but there are additional optimizations you can use ...
For example, suppose a system predominantly inserts and only occasionally updates and deletes.
Suppose further that you only want to parse with dimA and that id incremented. Then, having a structure such as
dimA_agg(dimA, Total, Count, LastID)
can help without significant impact on the system.
This is because you can have triggers that will not fire on every insert, but allow you to talk about 100 inserts.
This way you can get the exact aggregates from this table, and the details table with
SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA FROM details d INNER JOIN dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID GROUP BY a.dimA
In the above query with the correct indexes, one row from dimA_agg and only less than 100 rows from detail will be obtained - this will be done almost in constant time (~ log fanout n) and there will be no need to upgrade to dimA_agg for each insert (reduction of update penalties) .
The value 100 was just given as an example, you have to find the optimal value yourself (or even keep it variable, although in this case the triggers will be insufficient).
Maintenance of deletions and updates should start with each operation, but you can still check whether the identifier of the deleted or updated record is already in the statistics or not to avoid unnecessary updates (it will save some I / O operations).
Note. Analysis is performed for a domain with discrete attributes; when working with time series, the situation becomes more complicated - you need to determine the level of detail of the domain in which you want to save the summary.
EDIT
There are also materialized representations , 2 , 3