Postgresql Sort registered table with index

I'm currently working on a complex sorting problem in Postgres 9.2. You can find the Source code used in this Question (simplified) here: http://sqlfiddle.com/#!12/9857e/11

I have a huge table (β†’ 20Mio rows) containing different columns of different types.

CREATE TABLE data_table ( id bigserial PRIMARY KEY, column_a character(1), column_b integer -- ~100 more columns ); 

Suppose I want to sort this table over 2 columns ( ASC ). But I don’t want to do this simply with Order By, because later I may need to insert lines into the sorted output, and the user probably only wants to see 100 lines at once (sorted output).

To achieve these goals, I do the following:

 CREATE TABLE meta_table ( id bigserial PRIMARY KEY, id_data bigint NOT NULL -- refers to the data_table ); --Function to get the Column A of the current row CREATE OR REPLACE FUNCTION get_column_a(bigint) RETURNS character AS 'SELECT column_a FROM data_table WHERE id=$1' LANGUAGE sql IMMUTABLE STRICT; --Function to get the Column B of the current row CREATE OR REPLACE FUNCTION get_column_b(bigint) RETURNS integer AS 'SELECT column_b FROM data_table WHERE id=$1' LANGUAGE sql IMMUTABLE STRICT; --Creating a index on expression: CREATE INDEX meta_sort_index ON meta_table USING btree (get_column_a(id_data), get_column_b(id_data), id_data); 

And then I copy the data_table identifier to meta_table:

 INSERT INTO meta_table(id_data) (SELECT id FROM data_table); 

Later I can add extra rows to the table with a similar simple insert.
To get lines 900000 - 900099 ( 100 lines ), I can now use:

 SELECT get_column_a(id_data), get_column_b(id_data), id_data FROM meta_table ORDER BY 1,2,3 OFFSET 900000 LIMIT 100; 

(With an extra INNER JOIN on data_table if I want all the data.)
Resulting Plan:

 Limit (cost=498956.59..499012.03 rows=100 width=8) -> Index Only Scan using meta_sort_index on meta_table (cost=0.00..554396.21 rows=1000000 width=8) 

This is a pretty effective plan (only for scanning only in Postgres 9.2). But what if I want to get lines 20'000'000 - 20'000'099 ( 100 lines )? The same plan, much longer execution time. To improve offset performance ( Improve OFFSET performance in PostgreSQL ), I can do the following (suppose I save every 100'000 rows in a different table).

 SELECT get_column_a(id_data), get_column_b(id_data), id_data FROM meta_table WHERE (get_column_a(id_data), get_column_b(id_data), id_data ) >= (get_column_a(587857), get_column_b(587857), 587857 ) ORDER BY 1,2,3 LIMIT 100; 

It works a lot faster. Resulting Plan:

 Limit (cost=0.51..61.13 rows=100 width=8) -> Index Only Scan using meta_sort_index on meta_table (cost=0.51..193379.65 rows=318954 width=8) Index Cond: (ROW((get_column_a(id_data)), (get_column_b(id_data)), id_data) >= ROW('Z'::bpchar, 27857, 587857)) 

So far, everything is working fine, and postgres is doing great!

Suppose I want to change the order of the second column to DESC .
But then I would have to change my WHERE clause because the Operator> Operator compares both ASC columns. The same request as above (ASC Ordering) can also be written as:

 SELECT get_column_a(id_data), get_column_b(id_data), id_data FROM meta_table WHERE (get_column_a(id_data) > get_column_a(587857)) OR (get_column_a(id_data) = get_column_a(587857) AND ((get_column_b(id_data) > get_column_b(587857)) OR ( (get_column_b(id_data) = get_column_b(587857)) AND (id_data >= 587857)))) ORDER BY 1,2,3 LIMIT 100; 

Now the plan changes and the request becomes slow:

 Limit (cost=0.00..1095.94 rows=100 width=8) -> Index Only Scan using meta_sort_index on meta_table (cost=0.00..1117877.41 rows=102002 width=8) Filter: (((get_column_a(id_data)) > 'Z'::bpchar) OR (((get_column_a(id_data)) = 'Z'::bpchar) AND (((get_column_b(id_data)) > 27857) OR (((get_column_b(id_data)) = 27857) AND (id_data >= 587857))))) 

How can I use an effective old plan with DESC-Ordering?
Do you have any better ideas on how to solve the problem?

(I already tried to declare my own type with my own operator classes, but this is too slow)

+4
source share
1 answer

You need to rethink your approach. Where to begin? This is a clear example of mainly performance limitations, the kind of functional approach you use for SQL. The functions are basically not transparent, and you call two different searches on data_table for each row received, because the stored procedure plans cannot be stacked together.

Now, much worse, you are indexing one table based on data in another. This can work with workloads only for applications (inserts are allowed, but without updates), but it will not work if updates can be applied in the data_table. If the data in the data table changes, you will get the wrong index result.

In these cases, you are almost always better at writing to join as explicit and letting the scheduler best retrieve the data.

Now your problem is that your index becomes much less useful (and much more disk intensive I / O) when changing the order of the second column. On the other hand, if you had two different indexes on data_table and there was an explicit connection, PostgreSQL could handle this more easily.

+4
source

Source: https://habr.com/ru/post/1492734/


All Articles