Why doesn't this query use postgresql only index check

I have a table with 16 columns in which there is a primary key and a column to store the values. I want to select all values ​​in a specific range. The value (easyid) column has been indexed.

create table tb1 ( id Int primary key, easyid Int, ..... ) create index i_easyid on tb1 (easyid) 

Additional information: postgresql 9.4, no automatic vacuum. Sql is as follows.

 select "easyid" from "tb1" where "easyid" between 12183318 and 82283318 

Theoretically, postgresql should only use index scanning on i_easyid only. It only shows the index, only if the range of "easyid" between A and B small. When the range is large, namely BA is a fairly large number, postgresql uses scanning the raster index on i_easyid , and then scanning the heap of bits on tb1 .

I was mistaken if I said only index scanning or is not dependent on the size of the range. I tried the same query with different parameters, sometimes it is an index scan, sometimes it is not.

Table tb1 very large up to 17G. i_easyid - 600 MB.

Here is the sql explanation. And I do not understand why 4000 lines can cost more than 10 seconds.

 sample_pg=# explain analyze select easyid from tb1 where "easyid" between 152183318 and 152283318; QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------- Bitmap Heap Scan on tb1 (cost=97.70..17227.71 rows=4416 width=4) (actual time=1.155..14346.311 rows=5004 loops=1) Recheck Cond: ((easyid >= 152183318) AND (easyid <= 152283318)) Heap Blocks: exact=4995 -> Bitmap Index Scan on i_easyid (cost=0.00..96.60 rows=4416 width=0) (actual time=0.586..0.586 rows=5004 loops=1) Index Cond: ((easyid >= 152183318) AND (easyid <= 152283318)) Planning time: 0.080 ms Execution time: 14348.037 ms (7 rows) 

Here is an example of scanning only the index:

 sample_pg=# explain analyze verbose select easyid from tb1 where "easyid" between 32280318 and 32283318; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------- Index Only Scan using i_easyid on public.tb1 (cost=0.44..281.82 rows=69 width=4) (actual time=14.585..160.624 rows=33 loops=1) Output: easyid Index Cond: ((tb1.easyid >= 32280318) AND (tb1.easyid <= 32283318)) Heap Fetches: 33 Planning time: 0.085 ms Execution time: 160.654 ms (6 rows) 
+6
source share
2 answers

autovacuum is not working

PostgreSQL-only checks require some information about which rows are "visible" for current transactions, i.e. not deleted versions of updated rows, not uncommitted inserts or new versions of updates.

This information is stored in a "visibility map".

Visibility mapping is supported by VACUUM , usually in the background by auto-vacuum workers.

If autovacuum does not support writing, or if autovacuum is disabled, then scanning only by index will probably not be used, because PostgreSQL will see that there is no data on the visibility map for a sufficient number of tables.

Enable autodiscovery. Then manually VACUUM table to get its relevance immediately.

BTW, in addition to visibility map information, auto VACUUM can also record hint information, which can quickly insert / update data to SELECT .

Autovacuum also maintains statistics on tables that are vital to efficient query planning. Disabling it will cause the scheduler to use more outdated information.

It is also crucial to prevent a problem called transactional authentication, which is an abnormal condition that can cause the entire database to crash until the time-consuming whole VACUUM table is executed.

Do not turn off autovacuum .

As for why he sometimes uses validation only for the index, and sometimes not, a few possibilities:

  • The current random_page_cost parameter random_page_cost you think that random I / O will be slower than it actually is, so it tries to complicate it.

  • Table statistics, especially limit values, are out of date. Thus, he does not understand that there is a good chance that the estimated value will be quickly detected in a scan of only the index,

  • The visibility map is out of date, so he believes that scanning only for the index will find too many values ​​that require heap checking to do it slower than other methods, especially if he believes that the proportion of values ​​that can be found is high.

Most of these problems are resolved, leaving only auto-vacuum. In fact, on frequently joined tables, you should set autovacuum to run much more often than the default value so that it updates the limit statistics more. (This helps to work with PostgreSQL scheduler problems with tables where the most frequently requested data is the last insert with an incremental identifier or timestamp, which means that the most desirable values ​​are never found in the table histograms and statistics).

Turn the auto vacuum on , then turn it on.

+9
source

I am not 100% sure, but I suspect PostgreSQL believes that it will be faster to read the table than the index, due to random_page_cost. Reading an index is potentially more expensive due to the need to find essentially random pages in it.

The data obtained from the table will need sorting, but the calculations probably assume that the total cost (sequential viewing of the table + sorting) is greater than (reading random indexes).

This can be partially verified by changing the value of random_page_cost, which is worth exploring if you use very fast disks or SSDs anyway.

+2
source

Source: https://habr.com/ru/post/984732/


All Articles