MySQL: forced query to use indices with a local variable in a WHERE clause

Context

I have an application that selects a weighted random record from a table for which summing prefixes (weights) is an important part. A simplified table definition is as follows:

CREATE TABLE entries ( id INT NOT NULL PRIMARY KEY AUTO_INCREMENT, weight DECIMAL(9, 3), fenwick DECIMAL(9, 3) ) ENGINE=MEMORY; 

where `fenwick` stores values โ€‹โ€‹in the `weights` tree `weights` .

Let the "range" of each record cover the sum of the prefix and the prefix sum + its weight. The application should generate a random number @r between 0 and SUM(weight) and find a record whose range covers @r , for example:

Weighted random entry selection

The Fenwick tree in combination with the MEMORY engine and binary search should allow me to find the corresponding record in O(lg^2(n)) time, and not O(n) time with a naive request:

 SELECT a.id-1 FROM (SELECT *, (@x: =@x +weight) AS counter FROM entries CROSS JOIN (SELECT @x:=0) a HAVING counter>@r LIMIT 1) a; 

Study

I am trying to condense the operation of the sum of the prefix in one request (as opposed to several array accesses found in scripting languages) due to the overhead of several requests. In this process, I realized that the traditional summation method, which includes accessing the elements in descending order of the key, will only summarize the first element. I was suspicious that MySQL works across tables linearly when variables are present in the WHERE . Here's the request:

 SELECT SUM(1) INTO @garbage FROM entries CROSS JOIN ( SELECT @sum:=0, @n: =@entryid ) a WHERE id=@n AND @n>0 AND (@n: =@n- (@n&( -@n ))) AND (@sum: =@sum +entries.fenwick); /*SELECT @sum*/ 

where @entryid is the identifier of the record whose prefix amount we are calculating. I created a query that worked (along with the lft function, which returns the leftmost bit of the integer):

 SET @n:=lft(@entryid); SET @sum:=0; SELECT SUM(1) INTO @garbage FROM entries WHERE id=@n AND @n< =@entryid AND (@n: =@n +lft(@entryid^@n)) AND (@sum: =@sum +entries.fenwick); /*SELECT @sum*/ 

but this only confirmed my suspicion of a linear search. Also an EXPLAIN request:

 +------+-------------+---------+------+---------------+------+---------+------+--------+-------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +------+-------------+---------+------+---------------+------+---------+------+--------+-------------+ | 1 | SIMPLE | entries | ALL | NULL | NULL | NULL | NULL | 752544 | Using where | +------+-------------+---------+------+---------------+------+---------+------+--------+-------------+ 1 row in set (0.00 sec) 

Indices:

 SHOW INDEXES FROM entries; +---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | Table | Non_unique | Key_name | Seq_in_index | Column_name | Collation | Cardinality | Sub_part | Packed | Null | Index_type | Comment | Index_comment | +---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ | entries | 0 | PRIMARY | 1 | id | NULL | 752544 | NULL | NULL | | HASH | | | +---------+------------+----------+--------------+-------------+-----------+-------------+----------+--------+------+------------+---------+---------------+ 1 row in set (0.00 sec) 

Now I have seen many questions about how to eliminate variables in the WHERE so that the optimizer can work with the query. However, I cannot imagine how this request can do without id=@n . I supposed to put the key values โ€‹โ€‹of the records that I want to summarize into a table and use joins, but I think that I get undesirable effects: either a lot of tables or a linear search, still @entryid .

Question

Is there a way to force MySQL to use indexes for this query? I will even try to use another DBMS if they offer this functionality.

+6
source share
4 answers

Renouncement

Fenwick trees are new to me, I just discovered them by finding this post. The results presented here are based on my understanding and some research, but I am by no means an expert on wood, I could have missed something.

Reference materials

An explanation of how fenwick tree works

fooobar.com/questions/217858 / ... reproduced from https://cs.stackexchange.com/a/10541/38148

https://cs.stackexchange.com/a/42816/38148

Using fenwick trees

https://en.wikipedia.org/wiki/Fenwick_tree

https://en.wikipedia.org/wiki/Prefix_sum

Task 1, finding the weight of a given record

Given the following table

 CREATE TABLE `entries` ( `id` int(11) NOT NULL AUTO_INCREMENT, `weight` decimal(9,3) DEFAULT NULL, `fenwick` decimal(9,3) NOT NULL DEFAULT '0.000', PRIMARY KEY (`id`) ) ENGINE=INNODB; 

and the data is already populated (see http://sqlfiddle.com/#!9/be1f2/1 provided by concat), how to read the weight for a given @entryid record?

The key concept to understand here is that the fenwick index structure is based on math and bitwise operations on the id values โ€‹โ€‹themselves.

Queries should usually use only primary search queries ( WHERE ID = value ).

Any query using sorting ( ORDER BY ) or ranges ( WHERE (value1 < ID) AND (ID < value2)) skips the point and does not cross the tree in the given order.

For example, with key 60:

 SET @entryid := 60; 

Expand the value 60 in binary

 mysql> SELECT (@entryid & 0x0080) as b8, -> (@entryid & 0x0040) as b7, -> (@entryid & 0x0020) as b6, -> (@entryid & 0x0010) as b5, -> (@entryid & 0x0008) as b4, -> (@entryid & 0x0004) as b3, -> (@entryid & 0x0002) as b2, -> (@entryid & 0x0001) as b1; +------+------+------+------+------+------+------+------+ | b8 | b7 | b6 | b5 | b4 | b3 | b2 | b1 | +------+------+------+------+------+------+------+------+ | 0 | 0 | 32 | 16 | 8 | 4 | 0 | 0 | +------+------+------+------+------+------+------+------+ 1 row in set (0.00 sec) 

In other words, keeping only the set bits, we have

 32 + 16 + 8 + 4 = 60 

Now remove the low-order bits set one after the other to move through the tree:

 32 + 16 + 8 + 4 = 60 32 + 16 + 8 = 56 32 + 16 = 48 32 

This gives a path (32, 48, 56, 60) for access to element 60.

Note that converting 60 to (32, 48, 56, 60) only requires bit math for the ID value itself: there is no access to the table or database, and this calculation can be performed in the client issuing the request.

Icon fenwick element 60 then

 mysql> select sum(fenwick) from entries where id in (32, 48, 56, 60); +--------------+ | sum(fenwick) | +--------------+ | 32.434 | +--------------+ 1 row in set (0.00 sec) 

Check

 mysql> select sum(weight) from entries where id <= @entryid; +-------------+ | sum(weight) | +-------------+ | 32.434 | +-------------+ 1 row in set (0.00 sec) 

Now compare the effectiveness of these queries.

 mysql> explain select sum(fenwick) from entries where id in (32, 48, 56, 60); +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ | 1 | SIMPLE | entries | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 4 | 100.00 | Using where | +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ 

or presented otherwise

 explain format=json select sum(fenwick) from entries where id in (32, 48, 56, 60); { "query_block": { "select_id": 1, "cost_info": { "query_cost": "5.61" }, "table": { "table_name": "entries", "access_type": "range", "possible_keys": [ "PRIMARY" ], "key": "PRIMARY", "used_key_parts": [ "id" ], "key_length": "4", "rows_examined_per_scan": 4, "rows_produced_per_join": 4, "filtered": "100.00", "cost_info": { "read_cost": "4.81", "eval_cost": "0.80", "prefix_cost": "5.61", "data_read_per_join": "64" }, "used_columns": [ "id", "fenwick" ], "attached_condition": "(`test`.`entries`.`id` in (32,48,56,60))" } } 

So, the optimizer extracted 4 rows using the primary key (there are 4 values โ€‹โ€‹in the IN clause).

If you do not use the fenwick index, we have

 mysql> explain select sum(weight) from entries where id <= @entryid; +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ | id | select_type | table | partitions | type | possible_keys | key | key_len | ref | rows | filtered | Extra | +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ | 1 | SIMPLE | entries | NULL | range | PRIMARY | PRIMARY | 4 | NULL | 60 | 100.00 | Using where | +----+-------------+---------+------------+-------+---------------+---------+---------+------+------+----------+-------------+ 

Or presented otherwise

 explain format=json select sum(weight) from entries where id <= @entryid; { "query_block": { "select_id": 1, "cost_info": { "query_cost": "25.07" }, "table": { "table_name": "entries", "access_type": "range", "possible_keys": [ "PRIMARY" ], "key": "PRIMARY", "used_key_parts": [ "id" ], "key_length": "4", "rows_examined_per_scan": 60, "rows_produced_per_join": 60, "filtered": "100.00", "cost_info": { "read_cost": "13.07", "eval_cost": "12.00", "prefix_cost": "25.07", "data_read_per_join": "960" }, "used_columns": [ "id", "weight" ], "attached_condition": "(`test`.`entries`.`id` <= (@`entryid`))" } } 

Here, the optimizer scanned the index by reading 60 lines.

With ID = 60, the advantage of fenwick is 4 samples compared to 60.

Now consider how it scales with values โ€‹โ€‹up to 64K, for example.

With fenwick, a value of 16 bits will have no more than 16 bits, so the number of items to search will be no more than 16.

Without fenwick, a scan can read up to 64 thousand records (and will read an average of 32 KB).

Task 2, finding a record for a given weight

The OP's task was to find a record for a given weight.

for instance

 SET @search_weight := 35.123; 

To illustrate the algorithm, this post describes how search queries are performed (sorry if this is too verbose)

 SET @found_id := 0; 

First find the number of records.

 SET @max_id := (select id from entries order by id desc limit 1); 

In the test data, max_id is 156.

Since 128 <= max_id <256, the most significant bit to start the search is 128.

 mysql> set @search_id := @found_id + 128; mysql> select id, fenwick, @search_weight, -> if (fenwick <= @search_weight, "keep", "discard") as action -> from entries where id = @search_id; +-----+---------+----------------+---------+ | id | fenwick | @search_weight | action | +-----+---------+----------------+---------+ | 128 | 66.540 | 35.123 | discard | +-----+---------+----------------+---------+ 

The weight of 66.540 is more than our search, so 128 is discarded, go to the next bit.

 mysql> set @search_id := @found_id + 64; mysql> select id, fenwick, @search_weight, -> if (fenwick <= @search_weight, "keep", "discard") as action -> from entries where id = @search_id; +----+---------+----------------+--------+ | id | fenwick | @search_weight | action | +----+---------+----------------+--------+ | 64 | 33.950 | 35.123 | keep | +----+---------+----------------+--------+ 

Here we need to save this bit (64) and calculate the found weight:

 set @found_id := @search_id, @search_weight := @search_weight - 33.950; 

Then go to the following bits:

 mysql> set @search_id := @found_id + 32; mysql> select id, fenwick, @search_weight, -> if (fenwick <= @search_weight, "keep", "discard") as action -> from entries where id = @search_id; +----+---------+----------------+---------+ | id | fenwick | @search_weight | action | +----+---------+----------------+---------+ | 96 | 16.260 | 1.173 | discard | +----+---------+----------------+---------+ mysql> set @search_id := @found_id + 16; mysql> select id, fenwick, @search_weight, -> if (fenwick <= @search_weight, "keep", "discard") as action -> from entries where id = @search_id; +----+---------+----------------+---------+ | id | fenwick | @search_weight | action | +----+---------+----------------+---------+ | 80 | 7.394 | 1.173 | discard | +----+---------+----------------+---------+ mysql> set @search_id := @found_id + 8; mysql> select id, fenwick, @search_weight, -> if (fenwick <= @search_weight, "keep", "discard") as action -> from entries where id = @search_id; +----+---------+----------------+---------+ | id | fenwick | @search_weight | action | +----+---------+----------------+---------+ | 72 | 3.995 | 1.173 | discard | +----+---------+----------------+---------+ mysql> set @search_id := @found_id + 4; mysql> select id, fenwick, @search_weight, -> if (fenwick <= @search_weight, "keep", "discard") as action -> from entries where id = @search_id; +----+---------+----------------+---------+ | id | fenwick | @search_weight | action | +----+---------+----------------+---------+ | 68 | 1.915 | 1.173 | discard | +----+---------+----------------+---------+ mysql> set @search_id := @found_id + 2; mysql> select id, fenwick, @search_weight, -> if (fenwick <= @search_weight, "keep", "discard") as action -> from entries where id = @search_id; +----+---------+----------------+--------+ | id | fenwick | @search_weight | action | +----+---------+----------------+--------+ | 66 | 1.146 | 1.173 | keep | +----+---------+----------------+--------+ 

We found another bit here

 set @found_id := @search_id, @search_weight := @search_weight - 1.146; mysql> set @search_id := @found_id + 1; mysql> select id, fenwick, @search_weight, -> if (fenwick <= @search_weight, "keep", "discard") as action -> from entries where id = @search_id; +----+---------+----------------+--------+ | id | fenwick | @search_weight | action | +----+---------+----------------+--------+ | 67 | 0.010 | 0.027 | keep | +----+---------+----------------+--------+ 

And one more

 set @found_id := @search_id, @search_weight := @search_weight - 0.010; 

The end result of the search:

 mysql> select @found_id, @search_weight; +-----------+----------------+ | @found_id | @search_weight | +-----------+----------------+ | 67 | 0.017 | +-----------+----------------+ 

Check

 mysql> select sum(weight) from entries where id <= 67; +-------------+ | sum(weight) | +-------------+ | 35.106 | +-------------+ mysql> select sum(weight) from entries where id <= 68; +-------------+ | sum(weight) | +-------------+ | 35.865 | +-------------+ 

And indeed

 35.106 (fenwick[67]) <= 35.123 (search) <= 35.865 (fenwick[68]) 

The search searches for values โ€‹โ€‹to resolve 1 bit at a time, and each search result decides the value of the next identifier to search.

The queries shown here are for illustration. In a real application, the code should just be a loop that contains:

 SELECT fenwick from entries where id = ?; 

with application code (or a stored procedure) that implements the logic associated with @found_id, @search_id and @search_weight.

General comments

  • Fenwick trees are used to calculate the prefix. It makes sense to use these trees only if the problem for the solution includes prefixes in the first place. Wikipedia has several pointers to applications. The OP did not describe why, for example, the fenwick tree was used.
  • Fenwick trees are a trade-off between search complexity and update complexity, so they only make sense for data, not static. For static data, calculating the full prefix will one day be more efficient.
  • In our tests, we used the INNODB table, for which the primary key index is sorted, so calculating max_id is a simple O (1) operation. If max_id is already available elsewhere, I think that even using a MEMORY table with a HASH index for ID will work, since only primary searches are used.

PS

sqlfiddle does not work today, so publishing raw data (originally provided by concat) so that people polled can rerun tests.

 INSERT INTO `entries` VALUES (1,0.480,0.480),(2,0.542,1.022),(3,0.269,0.269),(4,0.721,2.012),(5,0.798,0.798),(6,0.825,1.623),(7,0.731,0.731),(8,0.181,4.547),(9,0.711,0.711),(10,0.013,0.724),(11,0.930,0.930),(12,0.613,2.267),(13,0.276,0.276),(14,0.539,0.815),(15,0.867,0.867),(16,0.718,9.214),(17,0.991,0.991),(18,0.801,1.792),(19,0.033,0.033),(20,0.759,2.584),(21,0.698,0.698),(22,0.212,0.910),(23,0.965,0.965),(24,0.189,4.648),(25,0.049,0.049),(26,0.678,0.727),(27,0.245,0.245),(28,0.190,1.162),(29,0.214,0.214),(30,0.502,0.716),(31,0.868,0.868),(32,0.834,17.442),(33,0.566,0.566),(34,0.327,0.893),(35,0.939,0.939),(36,0.713,2.545),(37,0.747,0.747),(38,0.595,1.342),(39,0.733,0.733),(40,0.884,5.504),(41,0.218,0.218),(42,0.437,0.655),(43,0.532,0.532),(44,0.350,1.537),(45,0.154,0.154),(46,0.721,0.875),(47,0.140,0.140),(48,0.538,8.594),(49,0.271,0.271),(50,0.739,1.010),(51,0.884,0.884),(52,0.203,2.097),(53,0.361,0.361),(54,0.197,0.558),(55,0.903,0.903),(56,0.923,4.481),(57,0.906,0.906),(58,0.761,1.667),(59,0.089,0.089),(60,0.161,1.917),(61,0.537,0.537),(62,0.201,0.738),(63,0.397,0.397),(64,0.381,33.950),(65,0.715,0.715),(66,0.431,1.146),(67,0.010,0.010),(68,0.759,1.915),(69,0.763,0.763),(70,0.537,1.300),(71,0.399,0.399),(72,0.381,3.995),(73,0.709,0.709),(74,0.401,1.110),(75,0.880,0.880),(76,0.198,2.188),(77,0.348,0.348),(78,0.148,0.496),(79,0.693,0.693),(80,0.022,7.394),(81,0.031,0.031),(82,0.089,0.120),(83,0.353,0.353),(84,0.498,0.971),(85,0.428,0.428),(86,0.650,1.078),(87,0.963,0.963),(88,0.866,3.878),(89,0.442,0.442),(90,0.610,1.052),(91,0.725,0.725),(92,0.797,2.574),(93,0.808,0.808),(94,0.648,1.456),(95,0.817,0.817),(96,0.141,16.260),(97,0.256,0.256),(98,0.855,1.111),(99,0.508,0.508),(100,0.976,2.595),(101,0.353,0.353),(102,0.840,1.193),(103,0.139,0.139),(104,0.178,4.105),(105,0.469,0.469),(106,0.814,1.283),(107,0.664,0.664),(108,0.876,2.823),(109,0.390,0.390),(110,0.323,0.713),(111,0.442,0.442),(112,0.241,8.324),(113,0.881,0.881),(114,0.681,1.562),(115,0.760,0.760),(116,0.760,3.082),(117,0.518,0.518),(118,0.313,0.831),(119,0.008,0.008),(120,0.103,4.024),(121,0.488,0.488),(122,0.135,0.623),(123,0.207,0.207),(124,0.633,1.463),(125,0.542,0.542),(126,0.812,1.354),(127,0.433,0.433),(128,0.732,66.540),(129,0.358,0.358),(130,0.594,0.952),(131,0.897,0.897),(132,0.701,2.550),(133,0.815,0.815),(134,0.973,1.788),(135,0.419,0.419),(136,0.175,4.932),(137,0.620,0.620),(138,0.573,1.193),(139,0.004,0.004),(140,0.304,1.501),(141,0.508,0.508),(142,0.629,1.137),(143,0.618,0.618),(144,0.206,8.394),(145,0.175,0.175),(146,0.255,0.430),(147,0.750,0.750),(148,0.987,2.167),(149,0.683,0.683),(150,0.453,1.136),(151,0.219,0.219),(152,0.734,4.256),(153,0.016,0.016),(154,0.874,0.891),(155,0.325,0.325),(156,0.002,1.217); 

PS 2

Now with full sqlfiddle:

http://sqlfiddle.com/#!9/d2c82/1

+3
source

(the response field is used because it has the ability to format).

ENGINE is the main problem in this case, as Rick pointed out. You can influence the type of index created using "USE BTREE" in creating the index (BTREE or HASH doesn't really matter much in this case: you repeat the range: then BTREE is optimal. However, you extract it for the value, then HASH will optimal: your request has both behaviors).

When you switch to INNODB, caches make the request, probably as fast as in the memory table. You have an index advantage. To guarantee BTREE indexing, I would create a schema as follows:

 CREATE TABLE `entries` ( `id` int(11) NOT NULL AUTO_INCREMENT, `weight` decimal(9,3) DEFAULT NULL, `fenwick` decimal(9,3) NOT NULL DEFAULT '0.000', PRIMARY KEY (`id`) ) ENGINE=INNODB DEFAULT CHARSET=latin1; CREATE UNIQUE INDEX idx_nn_1 ON entries (id,fenwick) USING BTREE; 

Here the index idx_nn_1 is used for the main calculation (And only the index: the whole table is not used at all, since all the data is in the index). However, the sample size of 100 records is too small to give a definitive answer to performance. The time it takes an index to build compared to data that is accessible only through a table can be such that you donโ€™t get a performance boost at all. So the final answer will be in your test.

Other database engines (SQL Server, Oracle, Postgres): they will show similar behavior. Thus, the transition to any of these engines will not matter much, except, perhaps, for better processing in general (without the ability to predict this).

SQL Server may be slightly better (= faster) when creating an index, as it simply uses a unique index for id and includes a fenwick value, so you don't really need to index this value.

Oracle can actually force indexes, but this is not recommended: in Oracle, assuming ordered data in a table, reading a table is faster than reading an index, and then the table to search. Again in this scenario, you can simply add the id, fenwick index and never access the table. Given the time the index was created, Oracle would still have to read the full table, and at that time (or less, depending on the number of records needed to achieve the exit condition), it would also perform your calculation.

0
source

Is the Fenwick tree static enough to comprehend some things? If so, I can give you a practical O (1) solution:

  • build a table with two columns (n, cumulative_sum)
  • preopulate table: (1, 0.851), (2, 2.574), (3, 2.916), (4, 3.817), ...
  • Create an index on FLOAT

Then to search:

 SELECT n FROM tbl WHERE cumulative_sum > 3.325 ORDER BY cumulative_sum LIMIT 1; 

If problems arise with @variables, then the stored procedure will build SQL through CONCAT , PREPARE and EXECUTE .

Addenda

Given that this is a periodic full replacement, calculate the totals when restoring the table. My SELECT looks at only one line, so this is O (1) (ignoring BTree search).

For a "full replacement" I recommend:

 CREATE TABLE new LIKE real; load the data into `new` -- this is the slowest step RENAME TABLE real TO old, new TO real; -- atomic and fast, so no "downtime" DROP TABLE old; 
0
source

To add Marc's answer, for stored procedures or functions where the list of indexes for summing cannot be directly passed through the arguments of the function, we can generate indexes in the request, and JOIN for the summation request:

 SELECT SUM(fenwick) FROM entries CROSS JOIN (SELECT @n:=60) a INNER JOIN ( SELECT @n AS id, 0 UNION SELECT IF(@n&@bit<>0, @n: =@n- (@n&@bit), NULL) AS id, (@bit: =@bit <<1) FROM entries CROSS JOIN (SELECT @bit:=1) a LIMIT 32 ) dt ON dt.id=entries.id; 

I expect the performance to be the same and the client will no longer generate indexes.

0
source

Source: https://habr.com/ru/post/984751/


All Articles