How does PostgreSQL execute ORDER BY if a tree index b is created in this field?

Question

How does PostgreSQL execute ORDER BY if a tree index b is created in this field?

I have a bsort table:

 CREATE TABLE bsort(a int, data text);

Here data may be incomplete. In other words, some tuples may not matter data .

And then I create the index of table b in the table:

 CREATE INDEX ON bsort USING BTREE(a);

Now, if I execute this query:

 SELECT * FROM bsort ORDER BY a;

Does PostgreSQL sorts sorts using nlogn complexity, or does it get order directly from the b-tree index?

+2

sorting indexing sql-order-by postgresql postgresql-performance

Kingston Chan Jul 23 '15 at 1:20

source share

2 answers

You will need to check the implementation plan. But Postgres is quite capable of using an index to make order by more efficient. He will read the records directly from the index. Since you have only one column, there is no need to access data pages.

0

Gordon Linoff Jul 23 '15 at 1:28

source share

Erwin Brandstetter · Accepted Answer · 2015-07-23 02:30

For a simple query such as Postgres, index scan will be used and get the easily sorted tuples from the index in order. Because of its MVCC model, Postgres always had to visit the heap (data pages) additionally to verify that the records were indeed visible in the current transaction. Quoting the Postgres Wiki for crawling by index only :

PostgreSQL indexes do not contain visibility information. That is, it cannot be directly established whether any given tuple is visible to the current transaction, so it lasted so long only for the index to be scanned.

What finally happened in version 9.2 : for indexing only . Documentation:

If the index stores the original indexed data values (and not some lossy representation), it is useful to maintain only a scan index, in which the index returns actual data not only TID of a bunch of tuple. This will only work if the visibility map shows that the TID is on a fully visible page; otherwise a bunch of tuple should be visited anyway to check MVCC visibility.

So now it depends on the table visibility map , regardless of whether it is only possible to view by index. Only an option if all involved columns are included in the index. In addition, the heap should be visited (optional) in any case. The sort step is still not needed.

This is why we sometimes add unnecessary columns to indexes. Like the data column in your example:

 CREATE INDEX ON bsort USING BTREE(a, data);

This makes the index larger (dependent) and slightly more expensive to maintain and use for other purposes that do not allow only indexing to be scanned. So add a data column if you only get index checks. The order of the columns in the index is important:

The advantage of scanning only an index, for documentation:

If you know that all the tuples on the page are visible, a bunch of selections may be skipped. This is most noticeable in large datasets, where a visibility map can prevent disk access. A visibility map is much smaller than a heap, so it can be easily cached even when the heap is very large.

The visibility map is supported by VACUUM , which happens automatically if you have autovacuum (the default setting in modern Postgres). Details:

Is there a regular VACUUM ANALYSIS that is still recommended under 9.1?

But there is some delay between write operations with the table and the subsequent VACUUM run. Its essence:

Read-only tables remain ready for scanning only by index after the vacuum cleaner.
Data pages that have been changed lose their "all visible" flag on the map until the next VACUUM (and all older tansactions are completed), so this depends on the relationship between write operations and the VACUUM frequency.

Partial indexing is only still possible if some of the pages involved are marked completely visible. But if the heap should be visited anyway, the index scan access method is a little cheaper. Therefore, if too many pages are currently dirty, Postgres will switch to a cheaper index view. Postgres wiki page again :

Since the number of heap samples (or "visits") that are predicted to be necessary for the scheduler, the final scheduler is that viewing by index is undesirable because it is not the cheapest possible plan in accordance with its cost model. The value of the index only for scanning lies entirely in their potential to allow us to overcome access to the heap (at least partially) and minimize I / O.

How does PostgreSQL execute ORDER BY if a tree index b is created in this field?

More articles: