How to increase average method performance in SQL?

Question

How to increase average method performance in SQL?

I'm having performance issues when an SQL query calculating the average value of a column gradually becomes slower as the number of records grows. Is there an index type that I can add to a column that will speed up average calculations?

PostgreSQL is in this database, and I know that a particular type of index may not be available, but I am also interested in a theoretical answer, the weather is possible even without any caching solution.

To be more specific, the data in question is essentially a journal with the following definition:

table log { int duration date time string event }

I make requests like

 SELECT average(duration) FROM log WHERE event = 'finished'; # gets average time to completion SELECT average(duration) FROM log WHERE event = 'finished' and date > $yesterday; # average today

The second is always fast enough, since it has a more restrictive WHERE clause, but the overall average duration is the type of request that causes the problem. I understand that I can cache values using OLAP or something like that, my question is the weather, I can do this completely thanks to optimizations of the database side, such as indexes.

+4

performance sql postgresql

Sindri traustason Dec 15 '10 at 12:03

source share

5 answers

Depends on what you do? If you do not filter the data, then after the clustered index is in order, how else to calculate the database in the middle column?

There are systems that perform online analytical processing (OLAP) that will do things such as storing current amounts and averages from the information you want to learn. It all depends on what you are doing and your definition of "slow."

If you have a web program, for example, you can generate an average number of times per minute and then cache it by adding cached value to users over and over again.

+2

Spence Dec 15 '10 at 12:07

source share

Aggregate acceleration is usually done by storing additional tables.

Assuming a detail(id, dimA, dimB, dimC, value) table detail(id, dimA, dimB, dimC, value) , if you want the performance of AVG (or other aggregate functions) to be almost constant, regardless of the number of records, you could introduce a new table

dimAavg(dimA, avgValue)

The size of this table will depend only on the number of different dimA values (in addition, this table may make sense in your design, since it may contain the range of values available for dimA in detail (and other attributes related to the domain value you can / should have such a table)
This table is only useful if you will only process dimA, as soon as you need AVG (value) according to dimA and dimB, it becomes useless. So, you need to know what attributes you want to quickly analyze. The number of rows required to store aggregates by several attributes is n(dimA) xn(dimB) xn(dimC) x ... , which may or may not grow quite quickly.
Maintaining this table increases the costs of upgrades (including insertions and deletions), but there are additional optimizations you can use ...

For example, suppose a system predominantly inserts and only occasionally updates and deletes.

Suppose further that you only want to parse with dimA and that id incremented. Then, having a structure such as

 dimA_agg(dimA, Total, Count, LastID)

can help without significant impact on the system.

This is because you can have triggers that will not fire on every insert, but allow you to talk about 100 inserts.

This way you can get the exact aggregates from this table, and the details table with

 SELECT a.dimA, (SUM(d.value)+MAX(a.Total))/(COUNT(d.id)+MAX(a.Count)) as avgDimA FROM details d INNER JOIN dimA_agg a ON a.dimA = d.dimA AND d.id > a.LastID GROUP BY a.dimA

In the above query with the correct indexes, one row from dimA_agg and only less than 100 rows from detail will be obtained - this will be done almost in constant time (~ log _fanout n) and there will be no need to upgrade to dimA_agg for each insert (reduction of update penalties) .

The value 100 was just given as an example, you have to find the optimal value yourself (or even keep it variable, although in this case the triggers will be insufficient).

Maintenance of deletions and updates should start with each operation, but you can still check whether the identifier of the deleted or updated record is already in the statistics or not to avoid unnecessary updates (it will save some I / O operations).

Note. Analysis is performed for a domain with discrete attributes; when working with time series, the situation becomes more complicated - you need to determine the level of detail of the domain in which you want to save the summary.

EDIT

There are also materialized representations , 2 , 3

+2

Unreason Dec 15 '10 at 13:34

source share

Just guessing, but indexes will not help, since the average should read the entire record (in any order), indexes are useful for subsets of subsets of rows, ubt, if you need to iterate over all rows without special ordering indices, they don’t help ...

0

smichaud Dec 15 '10 at 12:08

source share

This may not be what you are looking for, but if your table has a way to sort the data (e.g. by date), you can just do incremental calculations and save the results.

For example, if your data has a date column, you can calculate the average for records 1 - Date1, then save the average for that part along with Date1 and #records that you averaged. In the next calculation, you limit your query to the results of Date1..Date2 and add # records and update the last requested date. You have all the information you need to calculate the new average.

In this case, it would be useful to have an index in the date or any column (s) that you use for the order.

0

Johnd Dec 15 '10 at 12:18

source share

Guffa · Accepted Answer · 2010-12-15T12:22:00+0000

The average calculation performance will always be slower the more records you have, and you always need to use the values from each record as a result.

An index can still help if the index contains less data than the table itself. Creating an index for a field that usually requires an average value is not useful, since you do not want to search, you just want to use all the data as efficiently as possible. Usually you add a field as an output field to an index that is already used in the query.

How to increase average method performance in SQL?

More articles: