This is a different approach.
This query will experience the same performance problems as other queries that return the correct results, because the execution plan for this query will require a SORT operation on EVERY row of the statistics table. Since there is no predicate (restriction) in the time column, we will consider the QUALITY line in the statistics table. For a REALLY large stats
table, this will remove all available temporary space before it dies from a terrible death. (Additional performance notes below.)
SELECT r.* , IFNULL(s.avg_votes,0) FROM servers r LEFT JOIN ( SELECT t.server , AVG(t.votes) AS avg_votes FROM ( SELECT CASE WHEN u.server = @last_server THEN @i := @i + 1 ELSE @i := 1 END AS i , @last_server := u.server AS `server` , u.votes AS votes FROM (SELECT @i := 0, @last_server := NULL) i JOIN ( SELECT v.server, v.votes FROM stats v ORDER BY v.server DESC, v.time DESC ) u ) t WHERE ti <= 24 GROUP BY t.server ) s ON s.server = r.id
This query sorts the statistics table, by server and descending in the time column. (Inline view aliased as u
.)
With a sorted result set, we assign line numbers 1,2,3, etc. each row for each server. (Inline view aliased as t
.)
With this set of results, we filter out any rows with rownumber> 24 and calculate the average value of the votes
column for the "last" 24 rows for each server. (Inline view aliased as s
.)
As a final step, we attach this to the server table to return the requested result set.
Note:
The execution plan for this query will be COSTLY for a large number of rows in the stats
table.
To increase productivity, we can take several approaches.
The simplest way is to include a significant number of rows from the stats
table in the EXCLUDES predicate query (for example, rows with time
values older than 2 days or older than 2 weeks). This would significantly reduce the number of lines that need to be sorted to determine the "last" 24 lines.
In addition, with an index on stats(server,time)
, it is also possible that MySQL could do a relatively efficient “reverse scan” of the index, avoiding the sort operation.
We could also consider using the index in the statistics table on (server,"reverse_time")
. Since MySQL does not yet support descending indexes, the implementation will be a really regular (incremental) index for the derived rtime
value (the expression "reverse time" that increments for descending time
values (for example, -1*UNIX_TIMESTAMP(my_timestamp)
or -1*TIMESTAMPDIFF('1970-01-01',my_datetime)
.
Another approach to improving performance is to keep a shadow table containing the last 24 rows for each server. This would be easier to implement if we can guarantee that the "last lines" will not be removed from the stats
table. We could maintain this table with a trigger. Basically, whenever a row is inserted into the stats
table, we check to see if time
in new rows is later than the earliest time
stored for the server in the shadow table, if so, we replace the earliest row in the shadow table with new line, do not forget to save no more than 24 lines in the shadow table for each server.
And another approach is to write a procedure or function that gets the result. The approach here is to loop through each server and launch a separate query on the statistics table to get the average votes
for the last 24 rows and put all these results together. (This approach can indeed be a rather workaround to avoid sorting on a huge temporary set, just to return the returned result set, without necessarily making the return of results very fast.)
The bottom line for performing this type of query in the LARGE table limits the number of rows considered in the query and excludes the sort operation on a large set. This is how we get such a request.
ADDITION
To get the "reverse index" operation (to get the rows from stats
ordered using the index WITHOUT the filesort operation), I had to specify DESCENDING for both expressions in the ORDER BY clause. Previously, the query had ORDER BY server ASC, time DESC
, and MySQL always wanted to make a file array, even specifying the FORCE INDEX FOR ORDER BY (stats_ix1)
.
If the requirement is to return the “average voice” only for the server only , if there are at least 24 related rows in the statistics table, then we can make a more efficient query, even if it is a little more dirty. (Most of the clutter in IF () nested functions is to deal with NULL values that are not included in the average. This can be much less messy if we have a guarantee that votes
not NULL, or if we exclude any lines where votes
are NULL.)
SELECT r.* , IFNULL(s.avg_votes,0) FROM servers r LEFT JOIN ( SELECT t.server , t.tot/NULLIF(t.cnt,0) AS avg_votes FROM ( SELECT IF(v.server = @last_server, @num := @num + 1, @num := 1) AS num , @cnt := IF(v.server = @last_server,IF(@num <= 24, @cnt := @cnt + IF(v.votes IS NULL,0,1),@cnt := 0),@cnt := IF(v.votes IS NULL,0,1)) AS cnt , @tot := IF(v.server = @last_server,IF(@num <= 24, @tot := @tot + IFNULL(v.votes,0) ,@tot := 0),@tot := IFNULL(v.votes,0) ) AS tot , @last_server := v.server AS SERVER
With a coverage index on stats(server,time,votes)
, EXPLAIN showed that MySQL avoids the fileort operation, so it had to use a "reverse index scan" to get the rows back in order. There is no coverage index and index on '(server, time) , MySQL used the index if I included an index hint, with the
FORCE INDEX FOR ORDER BY (stats_ix1) `hint, MySQL also avoided the file array. (But since my table had less than 100 rows, I don’t think MySQL pays much attention to avoiding the fileort operation.)
Expressions of time, voices and avg_sofar are expressed (in the embedded representation with the alias t
); they are not needed, but they are intended for debugging.
The way this request costs, for each server, at least 24 lines of statistics are required to return the average value. (This may be acceptable.) But I thought that in general we can return the total, total (tot) and operation counter (cnt).
(If we replace WHERE t.num = 24
with WHERE t.num <= 24
, we will see the current average in action.)
To return the average value when there are at least 24 lines in the statistics, it is really a question of identifying a line (for each server) with a maximum value of num, which is <= 24.