Postgresql ORDER BY - choosing the right index

There is a table T(user, timestamp,...) with 100 ml + records (PostgreSQL 9.1).

Form request

 SELECT * FROM T WHERE user='abcd' ORDER BY timestamp LIMIT 1 

uses the timestamp index instead of the user index when there are ~ 100,000 user records.

Using a timestamp index always gives poor results (20+ seconds), as it ultimately scans all records. Bypassing the timestamp index by changing the query to use ORDER BY DATE(timestamp) will cause the query to be applied to the user index and give results that are less than 100 ms.

  • Total RAM: 64 GB
  • shared_buffers: 16 GB
  • work_mem: 32 MB

Why does postgresql ignore the user index and use the timestamp index instead (the timestamp index should see all records)? Are there any postgresql configuration options that can be changed to force the request to use the username index?

+6
source share
1 answer

Good question, I have run into this problem for a long time.

Why is this happening?

You should look at the number of user='abcd' values ​​in stats as follows:

 SELECT attname, null_frac, ag_width, n_distinct, most_common_vals, most_common_freqs, histogram_bounds FROM pg_stats WHERE table_name='T'; 

My guess is that this value is quite common, and you will find it in the output of most_common_vals . Choosing the same element from most_common_freqs , you get the coefficient for the value, multiply it by the total number of lines (can be obtained from pg_class ) to get the number of lines that are estimated to have the value 'abcd' .

The planner assumes that all values ​​are linearly distributed. In fact, everything is different. In addition, there is currently no correlated statistics ( although some work in this direction ).

So, let's say user='abcd' value of 0.001 (per question) in the corresponding most_common_freqs . This means that the value will take place every 1000 rows (subject to linear distribution). It seems that if we scan the table in any way, we hit our user='abcd' in 1000 rows. Sounds like it should be fast! The scheduler thinks the same and selects the index in the timestamp column.

But this is not so. If we assume that your table T contains user activity logs, and user='abcd' been on vacation for the past 3 weeks, then this means that we will need to read quite a few rows from the timestamp index (data for 3 weeks), before we actually hit the line we need. Well, you, as a DBA, know this, but the scheduler assumes a linear distribution.

So how to fix it?

You will need to trick the scheduler to use what you need, because you have more knowledge about your data.

  • Use OFFSET 0 subquery trick :

     SELECT * FROM ( SELECT * FROM T WHERE user='abcd' OFFSET 0 ) ORDER BY timestamp LIMIT 1; 

    This trick protects the request from inlining, so the inside is executed on it.

  • Use CTE (named subquery):

     WITH s AS ( SELECT * FROM T WHERE user='abcd' ) SELECT * FROM s ORDER BY timestamp LIMIT 1; 

    In the documentation:

    A useful property of WITH queries is that they are evaluated only once per parent request, even if they are referenced more than once with the parent query or Sibling WITH queries.

  • Use count(*) for aggregated queries:

     SELECT min(session_id), count(*) -- instead of simply `min(session_id)` FROM T WHERE user='abcd' ORDER BY timestamp LIMIT 1; 

    It really is not applicable, but I would like to mention it.

And please consider upgrading to 9.3.

PS More about estiamtes lines in documents, of course .

+5
source

Source: https://habr.com/ru/post/978663/


All Articles