Postgresql index seq scans 100 million rows

I have a problem when an indexed query refuses to use an index because it is not selective enough (let 60 out of 130 million rows satisfy the condition) and therefore decides to use seqscan.

The problem I am facing is that seqscan is actually not the best choice in this case, for some reason it gets a really good result, but the truth is that seqscan only works fast if it was requested before , and this can load all of the buffers / caches.

Index scans can be slightly slower than seqscan if BOTHs of them are in buffers, but this rarely happens, and when both queries are cold, index scans are still much faster (ms versus seconds).

Please note that index scanning is excellent because I use the limit clause so that it can quickly collect these few rows.

I set the statistics to 1000 (the default is 100) and a vacuum cleaner just in case, but in the same story.

TL; DR: Seq scan vs index scan with a low selective index, seqscan is preferable, but the scheduler is wrong, seqscan is better if it is cached, otherwise it will be worse.

Request and plans, please note that index one was loaded from buffers, and seqscan was not complete.

explain (analyze, buffers) select * from identities_identity where email_domain = 'live.com' limit 100 'Limit (cost=0.00..63.50 rows=100 width=573) (actual time=75215.573..75215.640 rows=100 loops=1)' ' Buffers: shared hit=75113 read=588870' ' -> Seq Scan on identities_identity (cost=0.00..2980008.00 rows=4692733 width=573) (actual time=75215.571..75215.604 rows=100 loops=1)' ' Filter: ((email_domain)::text = 'live.com'::text)' ' Rows Removed by Filter: 54464136' ' Buffers: shared hit=75113 read=588870' 'Planning time: 0.097 ms' 'Execution time: 75215.675 ms' 'Limit (cost=0.57..187.26 rows=100 width=573) (actual time=0.027..0.090 rows=100 loops=1)' ' Buffers: shared hit=6' ' -> Index Scan using identities_identity_email_domain_9056bd28 on identities_identity (cost=0.57..8760978.66 rows=4692733 width=573) (actual time=0.026..0.057 rows=100 loops=1)' ' Index Cond: ((email_domain)::text = 'live.com'::text)' ' Buffers: shared hit=6' 'Planning time: 0.078 ms' 'Execution time: 0.124 ms' 

UPDATE:

Def table (email and email_domain indexes, both standard and varchar_pattern_ops)

 CREATE TABLE public.identities_identity ( id bigint NOT NULL DEFAULT nextval('identities_identity_id_seq'::regclass), email character varying(1000) COLLATE pg_catalog."default", email_domain character varying(1000) COLLATE pg_catalog."default", leak_id bigint NOT NULL, CONSTRAINT identities_identity_pkey PRIMARY KEY (id), CONSTRAINT identities_identity_leak_id_87e1ae4e_fk_identities_leak_id FOREIGN KEY (leak_id) REFERENCES public.identities_leak (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED ) 

Statistics table (after vacuum analysis

 attname, avg_width, n_distinct, correlation 'id',8,'-1','0.999988' 'email',23,'-0.636853','-0.020479' 'email_domain',10,'3876','0.696452' 'leak_id',8,'1','1' 
+5
source share
2 answers

Well, the solution was to reorder the data physically, so sequential scans for these special cases would not fail.

Basically, run CLUSTER identities_identity USING index_name; in a column that evenly distributes data (for example, the value in front of the email domain).

Sequential checks are now performed just fine, even with cold buffers.

However, @Laurenz Albe's answer is pretty good for the specific case I posted, and a good trick if clustering is not possible.

0
source

You can use the average trick to force an index scan:

 SELECT * FROM identities_identity WHERE email_domain IN ('live.com', NULL) ORDER BY email_domain LIMIT 100; 

If PostgreSQL needs to be sorted, using an index will always be cheaper.

If you have WHERE email_domain = 'live.com' , PostgreSQL is smart enough to know it doesn't need to be sorted, so I added a second useless element to trick it.

+1
source

Source: https://habr.com/ru/post/1272710/


All Articles