PostgreSQL partitioning with pivot table - partition restriction not used in query plan

I have a large table in PostgreSQL 9.2, which I divided as described in the manual . Almost! My real partition key is not in the partitioned table itself, but in the joined table like this (simplified):

-- millions to tens of millions of rows CREATE TABLE data ( slice_id integer NOT NULL, point_id integer NOT NULL, -- ... data columns ..., CONSTRAINT pk_data PRIMARY KEY (slice_id, point_id), CONSTRAINT fk_data_slice FOREIGN KEY (slice_id) REFERENCES slice (id) CONSTRAINT fk_data_point FOREIGN KEY (point_id) REFERENCES point (id) ) -- hundreds to thousands of rows CREATE TABLE slice ( id serial NOT NULL, partition_date timestamp without time zone NOT NULL, other_date timestamp without time zone NOT NULL, int_key integer NOT NULL CONSTRAINT pk_slice PRIMARY KEY (id) ) -- about 40,000 rows CREATE TABLE point ( -- ... similar to "slice" ... ) 

The partition table ( data ) contains rows for each combination of point and slice , each of which has a composite key. I want to split it only into one of the key columns, partition_date , which is part of slice . Of course, the limitations of checking for my child tables cannot include this directly, so instead I include a range of all the slice.id values โ€‹โ€‹corresponding to this partition_date , for example:

 ALTER TABLE data_part_123 ADD CONSTRAINT ck_data_part_123 CHECK (slice_id >= 1234 AND slice_id <= 1278); 

All this works great for inserting data. However, queries do not use the specified CHECK constraint. For instance.

 SELECT * FROM data d JOIN slice s ON d.slice_id = s.id WHERE s.partition_date = '2013-07-23' 

I see in the query plan that this still scans all child tables. I tried to rewrite the query in several ways, including CTE and subselect, but that did not help.

Is there a way to get the scheduler to "understand" my partitioning scheme? I really don't want to duplicate the partition key millions of times in the data table.

The request scheme is as follows:

 Aggregate (cost=539243.88..539243.89 rows=1 width=0) -> Hash Join (cost=8.88..510714.02 rows=11411945 width=0) Hash Cond: (d.slice_id = s.id) -> Append (cost=0.00..322667.41 rows=19711542 width=4) -> Seq Scan on data d (cost=0.00..0.00 rows=1 width=4) -> Seq Scan on data_part_123 d (cost=0.00..135860.10 rows=8299610 width=4) -> Seq Scan on data_part_456 d (cost=0.00..186807.31 rows=11411931 width=4) -> Hash (cost=7.09..7.09 rows=143 width=4) -> Seq Scan on slice s (cost=0.00..7.09 rows=143 width=4) Filter: (partition_date = '2013-07-23 00:00:00'::timestamp without time zone) 
+4
source share
2 answers

The only way to achieve this is to make a dynamic query:

 create function select_from_data (p_date date) returns setof data as $function$ declare min_slice_id integer, max_slice_id integer; begin select min(slice_id), max(slice_id) into min_slice_id, max_slice_id from slice where partition_date = p_date; return query execute $dynamic$ select * from data where slice_id between $1 and $2 $dynamic$ using min_slice_id, max_slice_id; end; $function$ language plpgsql; 

This will build a query with an appropriate range of slices for a given date and schedule it at run time when the scheduler has the information needed to verify the exact sections.

To make the function more general without losing the ability of the scheduler to receive information during operation, use the or parameter is null construct in the filter.

 create function select_from_data ( p_date date, value_1 integer default null, value_2 integer default null ) returns setof data as $function$ declare min_slice_id integer, max_slice_id integer; begin select min(slice_id), max(slice_id) into min_slice_id, max_slice_id from slice where partition_date = p_date; return query execute $dynamic$ select * from data where slice_id between $1 and $2 and (some_col = $3 or $3 is null) and (another_col = $4 or $4 is null) $dynamic$ using min_slice_id, max_slice_id, value_1, value_2; end; $function$ language plpgsql; 

Now, if some parameter was passed as null , it will not interfere with the request.

+4
source

This circuit just won't work. constraint_exclusion is simple and dumb. He must be able to prove, by examining the inquiry during planning, that the inquiry cannot touch upon certain sections in order to exclude them.

Partition exclusion during query execution is not currently supported. There are many opportunities to improve support for the rudimentary partitioning of Pg, and eliminating run-time limits is just one of the areas that can exploit work.

Your application should be aware of the partitions and their limitations and will have to explicitly connect to the union of only the required partitions (s).

In this case, I'm not sure how PostgreSQL can even do what you want. I think you want it to project the constraint through the composite key in the connection, stating that since the request indicates s.partition_date = '2013-07-23' , and the request for all s.partition_date = '2013-07-23' identifiers with s.partition_date = '2013-07-23' finds them in the range slice_id >= 1234 AND slice_id <= 1278 , then only the data_part_123 section should be scanned.

The problem is that when scheduling time, PostgreSQL has absolutely no idea that s.partition_date = '2013-07-23 corresponds to a certain range of slice identifiers. Perhaps he will be able to determine it from the correlation statistics, if he saves them, but the statistics table is just approximations, and not the evidence necessary for separation.

I suspect you will have to denormalize your data a bit by duplicating slice.partition_date in each row of data if you want to split into it. You can either try not to synchronize them, or (what would I do) create a UNIQUE on slice(id, partition_date) , and then add a FOREIGN KEY link from the data partitions in slice , thereby ensuring that they cannot get out of sync for account of some additional index maintenance and insertion costs.

+3
source

Source: https://habr.com/ru/post/1492898/


All Articles