I have a problem when an indexed query refuses to use an index because it is not selective enough (let 60 out of 130 million rows satisfy the condition) and therefore decides to use seqscan.
The problem I am facing is that seqscan is actually not the best choice in this case, for some reason it gets a really good result, but the truth is that seqscan only works fast if it was requested before , and this can load all of the buffers / caches.
Index scans can be slightly slower than seqscan if BOTHs of them are in buffers, but this rarely happens, and when both queries are cold, index scans are still much faster (ms versus seconds).
Please note that index scanning is excellent because I use the limit clause so that it can quickly collect these few rows.
I set the statistics to 1000 (the default is 100) and a vacuum cleaner just in case, but in the same story.
TL; DR: Seq scan vs index scan with a low selective index, seqscan is preferable, but the scheduler is wrong, seqscan is better if it is cached, otherwise it will be worse.
Request and plans, please note that index one was loaded from buffers, and seqscan was not complete.
explain (analyze, buffers) select * from identities_identity where email_domain = 'live.com' limit 100 'Limit (cost=0.00..63.50 rows=100 width=573) (actual time=75215.573..75215.640 rows=100 loops=1)' ' Buffers: shared hit=75113 read=588870' ' -> Seq Scan on identities_identity (cost=0.00..2980008.00 rows=4692733 width=573) (actual time=75215.571..75215.604 rows=100 loops=1)' ' Filter: ((email_domain)::text = 'live.com'::text)' ' Rows Removed by Filter: 54464136' ' Buffers: shared hit=75113 read=588870' 'Planning time: 0.097 ms' 'Execution time: 75215.675 ms' 'Limit (cost=0.57..187.26 rows=100 width=573) (actual time=0.027..0.090 rows=100 loops=1)' ' Buffers: shared hit=6' ' -> Index Scan using identities_identity_email_domain_9056bd28 on identities_identity (cost=0.57..8760978.66 rows=4692733 width=573) (actual time=0.026..0.057 rows=100 loops=1)' ' Index Cond: ((email_domain)::text = 'live.com'::text)' ' Buffers: shared hit=6' 'Planning time: 0.078 ms' 'Execution time: 0.124 ms'
UPDATE:
Def table (email and email_domain indexes, both standard and varchar_pattern_ops)
CREATE TABLE public.identities_identity ( id bigint NOT NULL DEFAULT nextval('identities_identity_id_seq'::regclass), email character varying(1000) COLLATE pg_catalog."default", email_domain character varying(1000) COLLATE pg_catalog."default", leak_id bigint NOT NULL, CONSTRAINT identities_identity_pkey PRIMARY KEY (id), CONSTRAINT identities_identity_leak_id_87e1ae4e_fk_identities_leak_id FOREIGN KEY (leak_id) REFERENCES public.identities_leak (id) MATCH SIMPLE ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED )
Statistics table (after vacuum analysis
attname, avg_width, n_distinct, correlation 'id',8,'-1','0.999988' 'email',23,'-0.636853','-0.020479' 'email_domain',10,'3876','0.696452' 'leak_id',8,'1','1'