Reserve sorting for aggregate group monotonic function

I am developing a query for a table containing a bunch of points in a time series. The table can grow quite large, and therefore I want the query to effectively reduce the output by averaging points over fixed time intervals. After writing the query, I am surprised at how SQL Server (2008) decided to execute the query. The execution plan shows an unnecessary sorting operation that becomes expensive as the time series grows. Here is a problem that boils down to a simple example:

CREATE TABLE [dbo].[Example] ( [x] FLOAT NOT NULL, [y] FLOAT NOT NULL, PRIMARY KEY CLUSTERED ( [x] ASC ) ); SELECT FLOOR([x]), AVG([y]) FROM [dbo].[Example] GROUP BY FLOOR([x]); 

Here I have (x, y) pairs that are already sorted by x (due to the cluster primary key), and I average y for each integer x (by truncating using the FLOOR function). I would expect the table to be already sorted accordingly for the aggregate, since FLOOR is a monotonous function. Unfortunately, SQL Server decides that this data needs to be re-sorted, and here is the execution plan:

Example Execution Plan

Should SQL Server stream aggregate over data grouped by the monotonous function of columns that are already sorted properly?

Is there a general way to rewrite such queries so that SQL Server sees that the order is saved?

[Update] I found an article on this subject Things SQL needs: the ability of monotone functions , and, as the name implies, it seems to be an optimization that SQL Server has not yet done (in most cases).

Here are even simpler [dbo].[Example] queries that show a point:

 SELECT [x], [y] FROM [dbo].[Example] ORDER BY FLOOR([x]) --sort performed in execution plan SELECT [x], [y] FROM [dbo].[Example] ORDER BY 2*[x] --NO sort performed in execution plan SELECT [x], [y] FROM [dbo].[Example] ORDER BY 2*[x]+1 --sort performed in execution plan 

In each individual addition or multiplication, the query optimizer understands that the data already has the same order (and this is evident when you also group such expressions). Thus, the concept of monotone functions is understood by the optimizer, as a rule, it is not applied at all.

I am currently testing the computed column / index solution, but it looks like it will significantly increase the size of the data being saved, as I will need several indexes to cover the range of possible intervals.

+6
source share
3 answers

Some notes:

  • the plan you see when the table is empty, and the plan when the table has X rows can be completely different plans.
  • I don’t think it’s right to have a primary key in the X field. Can there be two points with the same X values?

I think you will have better query performance if you do something like this:

 create table Point ( PointId int identity(1, 1) constraint PK_Example_Id primary key, X float not null, Y float not null, FloorX as floor(x) persisted ) create index IX_Point_FloorX_Y on Point(FloorX, Y) 

Add a few lines:

 declare @RowCount int = 10000 while(@RowCount > 0) begin insert Point values (cast(crypt_gen_random(2) as int), cast(crypt_gen_random(2) as int)) set @RowCount -= 1 end 

Query:

 select floor(X), avg(Y) from Point group by floor(X) 

or

 select FloorX, avg(Y) from Point group by FloorX 

both will have the same plan

Plan: Sort

enter image description here

Another option is to create an indexed view. In this case, you will have to query the view directly if you do not have an Enterprise Edition that will use indexed view indexes, even if you query the table directly.

[Edit] Just realized that I did not answer your question explicitly. You asked why SQL does sorting if X is a clustered primary key. SQL does not sort by X , it does sort by floor(x) . In other words, if X already sorted, then f(x) will not necessarily have the same order, right?

+3
source

SQL Server almost always ignores indexes when there is any function in the index columns. There are good reasons:

  • The query optimizer (QO) uses data distribution statistics.
    The function changes this: do you expect statistics to be generated for each request?
  • A function (in this case, of course) can invalidate the uniqueness of the QO index and can use uniqueness in its plan generation.

Some optimizations are encoded in QO (for example: COUNT vs EXISTS in IF), but it does not make rigorous mathematical proofs: they do not apply to query response time

There is also MS Connect for some datetime functions (with which I actually disagree, because there are too many function permutations for optimization: so we would have inconsistency)

Otherwise, the solution with an indexed computed column from Alex Aza is what I would do

Edit:

Read your link in the updated question.

FLOOR changes strictly monotonously and monotonously. That is, x is unique, therefore strictly monotonous. FLOOR (x) is monotonic.

If you have any WHERE clauses, statistics become important: as you said, you posted simplified examples.

And for the example x * 2 + 1 that you posted: at what point do you think SQL Server stops evaluating expressions? Of course, this is a cost-based optimizer.

I think it's fair that SQL Server behaves like this: every day my EXISTS optimization example is much more useful.

+2
source

This is a very good question. In such cases, we want to have another table and use CROSS APPLY, as in the following example, which uses Numbers tables that store all the numbers between Min (X) / YourStepInMinutes AND Max (x) / YourStepInMinutes and two more numbers around Min and Maximum. This query is executed as nested loops, does not require sorting:

 SELECT nn, Avg(py) FROM dbo.Numbers AS n CROSS APPLY (SELECT py FROM dbo.Points AS p WHERE px<n*YourStepInMinutes AND (n-1)*YourStepInMinutes<=px ) As p 

Edit: although this solution requires a connection that is not free, I would not make an expression that always runs slowly. Sorting large amounts of data can suddenly become very slow - you sort 10% more rows, and your view can be 10 times slower. On the other hand, this approach scales better because it does not require a large huge view.

In addition, since we do not need a constant computed column, we can immediately use this query for intervals of any size, for example 17 minutes.

+1
source

Source: https://habr.com/ru/post/890308/


All Articles