MySQL is looking for a good index

I have this table (simplified version)

create table completions ( id int(11) not null auto_increment, completed_at datetime default null, is_mongo_synced tinyint(1) default '0', primary key (id), key index_completions_on_completed_at_and_is_mongo_synced_and_id (completed_at,is_mongo_synced,id), ) engine=innodb auto_increment=4785424 default charset=utf8 collate=utf8_unicode_ci; 

The size:

 select count(*) from completions; -- => 4817574 

Now I am trying to execute this query:

 select completions.* from completions where (completed_at is not null) and completions.is_mongo_synced = 0 order by completions.id asc limit 10; 

And it takes 9mins .

I see that no index is used, explain extend returns this:

 id: 1 select_type: SIMPLE table: completions type: index possible_keys: index_completions_on_completed_at_and_is_mongo_synced_and_id key: PRIMARY key_len: 4 ref: NULL rows: 20 filtered: 11616415.00 Extra: Using where 

If I force the index:

 select completions.* from completions force index(index_completions_on_completed_at_and_is_mongo_synced_and_id) where (completed_at is not null) and completions.is_mongo_synced = 0 order by completions.id asc limit 10; 

It takes 1.22 s , which is much better. explain extend returns:

 id: 1 select_type: SIMPLE table: completions type: range possible_keys: index_completions_on_completed_at_and_is_mongo_synced_and_id key: index_completions_on_completed_at_and_is_mongo_synced_and_id key_len: 6 ref: null rows: 2323334 filtered: 100 Extra: Using index condition; Using filesort 

Now, if I judge the completions.id request as follows:

 select completions.* from completions force index(index_completions_on_completed_at_and_is_mongo_synced_and_id) where (completed_at is not null) and completions.is_mongo_synced = 0 and completions.id > 2000000 order by completions.id asc limit 10; 

It takes 1.31 s , still good. explain extend returns:

 id: 1 select_type: SIMPLE table: completions type: range possible_keys: index_completions_on_completed_at_and_is_mongo_synced_and_id key: index_completions_on_completed_at_and_is_mongo_synced_and_id key_len: 6 ref: null rows: 2323407 filtered: 100 Extra: Using index condition; Using filesort 

The fact is that if for the last query I do not force the index:

 select completions.* from completions where (completed_at is not null) and completions.is_mongo_synced = 0 and completions.id > 2000000 order by completions.id asc limit 10; 

It takes 85 ms , check that it is ms , not s . explain extend returns:

 id: 1 select_type: SIMPLE table: completions type: range possible_keys: PRIMARYindex_completions_on_completed_at_and_is_mongo_synced_and_id key: PRIMARY key_len: 4 ref: null rows: 2323451 filtered: 100 Extra: Using where 

Not only does this annoy me, but the fact that the performance of the last request is highly dependent on small changes in the number of filters:

 select completions.* from completions where (completed_at is not null) and completions.is_mongo_synced = 0 and completions.id > 1600000 order by completions.id asc limit 10; 

13s required

Things I don't understand:

  • Why is lower query A faster than query B when query B involves using a more accurate index: c

Request A:

 select completions.* from completions where (completed_at is not null) and completions.is_mongo_synced = 0 and completions.id > 2000000 order by completions.id asc limit 10; 

85ms

Request B:

 select completions.* from completions force index(index_completions_on_completed_at_and_is_mongo_synced_and_id) where (completed_at is not null) and completions.is_mongo_synced = 0 and completions.id > 2000000 order by completions.id asc limit 10; 

1.31s

2. Why is there such a difference in the following queries:

Request A:

 select completions.* from completions where (completed_at is not null) and completions.is_mongo_synced = 0 and completions.id > 2000000 order by completions.id asc limit 10; 

85ms

Request B:

 select completions.* from completions where (completed_at is not null) and completions.is_mongo_synced = 0 and completions.id > 1600000 order by completions.id asc limit 10; 

13s

3. Why MySQL does not automatically use the index for the following query:

Pointer:

 key index_completions_on_completed_at_and_is_mongo_synced_and_id (completed_at,is_mongo_synced,id), 

Query:

 select completions.* from completions force index(index_completions_on_completed_at_and_is_mongo_synced_and_id) where (completed_at is not null) and completions.is_mongo_synced = 0 and completions.id > 2000000 order by completions.id asc limit 10; 

Update

Some more data requested in the comments

Number of rows based on is_mongo_synced values
  select completions.is_mongo_synced, count(*) from completions group by completions.is_mongo_synced; 

Result:

 [ { "is_mongo_synced":0, "count(*)":2731921 }, { "is_mongo_synced":1, "count(*)":2087869 } ] 
Requests without order by
 select completions.* from completions where (completed_at is not null) and completions.is_mongo_synced = 0 and completions.id > 2000000 limit 10; 

544ms

 select completions.* from completions force index(index_completions_on_completed_at_and_is_mongo_synced_and_id) where (completed_at is not null) and completions.is_mongo_synced = 0 and completions.id > 2000000 limit 10; 

314ms

But, in any case, I need an order, because I look at a batch batch of packages.

+5
source share
3 answers

Your questions are quite complicated. But for your first request:

 select completions.* from completions where completed_at is not null and completions.is_mongo_synced = 0 order by completions.id asc limit 10; 

The best index at (is_mongo_synced, completed_at) . There may be other ways to write the query, but the columns you are boosting do not have the columns in the optimal order.

The difference in performance in the second query is probably due to the fact that the data is actually sorted. A few hundred thousand lines can affect the sorting time. Dependence on the id value is probably the way to use the index. If you change the index to (is_mongo_synced, id, completed_at) , then using the index will be more likely.

MySQL has good documentation on composite indexes. You can view it here .

After adding the proposed filter

After adding the index:

 KEY `index_completions_on_is_mongo_synced_and_id_and_completed_at` (`is_mongo_synced`,`id`,`completed_at`) USING BTREE, 

Again run a long request

 select completions.* from completions where (completed_at is not null) and completions.is_mongo_synced = 0 order by completions.id asc limit 10; 

It takes 156ms , which is very good.

Checking explain extended , we see that MySQL uses the correct index:

 id: 1 select_type: SIMPLE table: completions type: ref possible_keys: index_completions_on_completed_at_and_is_mongo_synced_and_id,index_completions_on_is_mongo_synced_and_id_and_completed_at key: index_completions_on_is_mongo_synced_and_id_and_completed_at key_len: 2 ref: const rows: 1626322 filtered: 100 Extra: Using index condition; Using where 
+4
source

Are you trying to force the index

 (completed_at, is_mongo_synced, id) 

This is a b-tree, and it must first examine all the various completed_at values โ€‹โ€‹that are not NULL , and then the correct mongo_synced at for each of them, they collect all the identifiers and sort them and finally visit the table to retrieve the desired rows.

With the primary key, on the other hand, it (suppose it is a clustering key) simply jumps to the page with completion. gt> 2,000,000 and reads consecutive lines until it collects 10 of them, if not on this page, then the next one is selected.

In the end, both queries probably check the same number of pages in table +, the first should get the whole index and sort it.

If you want to use the index try

 (is_mongo_synced, id, completed_at) 

Consult the clustered index guide.

+3
source

Caution: I assume InnoDB.

To build an optimal index,

  • Collect all the '=' values. This is just is_mongo_synced . This allows you to search in one adjacent place in the index.
  • Add one more thing.

If you add completed_at , it will check all non-NULL entries and collect ids for sorting later. Sorting ( ORDER BY ) costs something and cannot be avoided with INDEX(is_mongo_synced, completed_at, ...) .

If you add id instead, there is now a chance that it can avoid sorting. But it still has to finish the filtering (to avoid the NULL completed_at lines). So INDEX(is_mongo_synced, id, ...) can be good.

If you have both indexes, the optimizer is not suitable for choosing between these two indexes, because it strongly depends on the distribution of data and whether you have a LIMIT . You, who understand the data, may or may not choose which index is better.

I said "...". I meant that you can stop there, or add more columns to the index. Adding more columns falls into the so-called โ€œcoverage indexโ€. If all the columns mentioned in the SELECT exist (anywhere) in the secondary index, then this is "coverage". So? First, let me come back ...

When you look at something in the secondary index, it finds a PRIMARY KEY at the bottom of BTree. He then scans another column (s), drilling a clustered PK BTree. This extra deployment can be expensive. But...

If the index "covers", then there is no need to do this extra BTree reversal.

You accidentally had a covering index, but not in the optimal order. The entire index had to be scanned and then sorted. Each of my indexes avoids scanning the entire index, thus, most likely, will be faster.

By adding an extra column, I have two (competing) coverage indexes:

 KEY mci (is_mongo_synced, completed_at, id) KEY mic (is_mongo_synced, id, completed_at) 

Aside ... Since PK is automatically added to each secondary key, these three-column indexes exist even if I mentioned only the first 2 columns. So, do not be surprised if you try 2 and 3, but do not find the difference.

For clarity, I will leave "mci" and "mic" with three explicit columns.

Reanalyze them ...

'mci' scans the part of the index containing is_mongo_synced=0 AND completed_at IS NOT NULL . These "strings" in the index are sequential, which minimizes disk images. And it gets the identifiers, which are then sorted.

'mic' scans the part of the index containing is_mongo_synced=0 . This is a big part than for "mci". But the identifiers are in order, thereby eliminating sorting. However, it should now scratch the NULL strings when scanning the index.

Bottom line. I would replace your composite index ('cmi') with "mic" and "mci".

If you have other queries, such as those related to a specific completed date (s), you may need an index starting with completed_at .

See also my cookbook by index: mysql.rjweb.org/doc.php/index_cookbook_mysql .

One more thing ... If all the data and / or index blocks you need are in the cache ("buffer pool"), the request will work, perhaps 10 times faster than if you got to disk. See innodb_buffer_pool_size for setting up this cache - usually 70% of the available memory is good. Your 9 minute test smells like the buffer pool was either cold or too small.

+2
source

Source: https://habr.com/ru/post/1237845/


All Articles