Postgres doesn't use index when scanning index much better

Question

Postgres doesn't use index when scanning index much better

I have a simple query to join two tables, which are very slow. I found out that, in terms of the query, seq scanning is performed in a large email_activities table (~ 10 m rows), while I think that using indexes that execute nested loops will actually be faster.

I rewrote the query using a subquery, trying to force an index to be used, and then noticed something interesting. If you look at the two query plans below, you will see that when I limit the set of results of the subquery to 43k, the query plan uses the index on email_activities, while setting the limit in the subquery to 44k will force the query plan to use seq scan on email_activities . One of them is clearly more efficient than the other, but Postgres doesn't seem to care.

What could be the reason for this? Does it have configs somewhere that forces to use a hash join if one of the set is larger than a certain size?

 explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 43000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=118261.50..118261.50 rows=1 width=4) (actual time=224.556..224.556 rows=1 loops=1) -> Nested Loop (cost=3699.03..118147.99 rows=227007 width=4) (actual time=32.586..209.076 rows=40789 loops=1) -> HashAggregate (cost=3698.94..3827.94 rows=43000 width=4) (actual time=32.572..47.276 rows=43000 loops=1) -> Limit (cost=0.09..3548.44 rows=43000 width=4) (actual time=0.017..22.547 rows=43000 loops=1) -> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.017..19.168 rows=43000 loops=1) Index Cond: (email_campaign_id = 1607) -> Index Only Scan using index_email_activities_on_email_recipient_id on email_activities (cost=0.09..2.64 rows=5 width=4) (actual time=0.003..0.003 rows=1 loops=43000) Index Cond: (email_recipient_id = email_recipients.id) Heap Fetches: 40789 Total runtime: 224.675 ms

and

 explain analyze SELECT COUNT(DISTINCT "email_activities"."email_recipient_id") FROM "email_activities" where email_recipient_id in (select "email_recipients"."id" from email_recipients WHERE "email_recipients"."email_campaign_id" = 1607 limit 50000); QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- Aggregate (cost=119306.25..119306.25 rows=1 width=4) (actual time=3050.612..3050.613 rows=1 loops=1) -> Hash Semi Join (cost=4451.08..119174.27 rows=263962 width=4) (actual time=1831.673..3038.683 rows=47935 loops=1) Hash Cond: (email_activities.email_recipient_id = email_recipients.id) -> Seq Scan on email_activities (cost=0.00..107490.96 rows=9359988 width=4) (actual time=0.003..751.988 rows=9360039 loops=1) -> Hash (cost=4276.08..4276.08 rows=50000 width=4) (actual time=34.058..34.058 rows=50000 loops=1) Buckets: 8192 Batches: 1 Memory Usage: 1758kB -> Limit (cost=0.09..4126.08 rows=50000 width=4) (actual time=0.016..27.302 rows=50000 loops=1) -> Index Scan using index_email_recipients_on_email_campaign_id on email_recipients (cost=0.09..5422.47 rows=65710 width=4) (actual time=0.016..22.244 rows=50000 loops=1) Index Cond: (email_campaign_id = 1607) Total runtime: 3050.660 ms

Version: PostgreSQL 9.3.10 on x86_64-unknown-linux-gnu, compiled by gcc (Ubuntu / Linaro 4.6.3-1ubuntu5) 4.6.3, 64-bit
email_activities: ~ 10m rows
email_recipients: ~ 11m rows

+5

sql database-indexes postgresql postgresql-performance

Ryan her Dec 30 '15 at 20:54

source share

2 answers

Erwin brandstetter · Answer 1 · 2016-01-04T02:46:23+0000

Index Scan → Raster Index Scan → Sequential Scan

For several rows, it pays to start index scan. When more rows are returned (a higher percentage of the table and depending on the data distribution, frequency of values and row width), it becomes more likely to find several rows on one data page. Then it pays to switch to scanning raster indexes. As soon as a large percentage of data pages should be visited anyway, it’s cheaper to run sequential scans, fill out redundant filter lines and generally skip overhead for indexes.

Postgres switches to sequential scanning, expecting to find rows=263962 , which is already 3% of the entire table. (So far only rows=47935 , see below.)

Beware of forcing query plans

You cannot force a specific scheduler method directly in Postgres, but you can make other methods extremely expensive for debugging purposes. See Planning Method Configuration in the manual.

SET enable_seqscan = off (as suggested in another answer) does this for sequential scanning. But this is only for debugging in your session. Do not use this as a general setup in production unless you know exactly what you are doing. This can make funny query plans. Quote guide :

These configuration options provide a rough method of influencing query plans selected by the query optimizer. If the default plan chosen by the optimizer for a particular request is not optimal, a workaround is to use one of these configuration parameters to force the optimizer to choose a different plan. The best ways to improve the quality of plans selected by the optimizer include (see Section 18.7.2 ) by running ANALYZE manually, increasing the default_statistics_target configuration parameter value and increasing the amount of statistics collected for using ALTER TABLE SET STATISTICS .

This is the most necessary advice.

Keep PostgreSQL from choosing a bad query plan

In this particular case, Postgres expects 5-6 times more hits on email_activities.email_recipient_id than it actually is:

estimated rows=227007 compared to actual ... rows=40789
rated rows=263962 vs actual ... rows=47935

If you run this query often, it will pay for ANALYZE look at a larger sample for more accurate statistics for a particular column. Your table is large (~ 10M rows), so do this:

 ALTER TABLE email_activities ALTER COLUMN email_recipient_id SET STATISTICS 3000; -- max 10000, default 100

Then ANALYZE email_activities;

Last resort measure

In very rare cases, you can resort to a forced index with SET LOCAL enable_seqscan = off in a single transaction or in a function with its own environment. How:

 CREATE OR REPLACE FUNCTION f_count_dist_recipients(_email_campaign_id int, _limit int) RETURNS bigint AS $func$ SELECT COUNT(DISTINCT a.email_recipient_id) FROM email_activities a WHERE a.email_recipient_id IN ( SELECT id FROM email_recipients WHERE email_campaign_id = $1 LIMIT $2) -- or consider query below $func$ LANGUAGE sql VOLATILE COST 100000 SET enable_seqscan = off ;

This parameter applies only to the local scope of the function.

Warning: This is just a proof of concept. Even this much less radical manual intervention can bite you in the long run. Values, frequency values, circuits, global Postgres settings, everything changes with time. You are about to upgrade to the new version of Postgres. Now the query plan that you are making now can be a very bad idea.

And usually, this is just a workaround for the problem with your setup. Better to find and fix.

Alternative request

There is no significant information in the question, but this equivalent query is probably faster and more likely to use the index ( email_recipient_id ) - more and more for LIMIT .

 SELECT COUNT(*) AS ct FROM ( SELECT id FROM email_recipients WHERE email_campaign_id = 1607 LIMIT 43000 ) r WHERE EXISTS ( SELECT 1 FROM email_activities WHERE email_recipient_id = r.id);

Ctx · Answer 2 · 2015-12-30T21:01:19+0000

Sequential scans can be more efficient even if an index exists. In this case, postgres seems to be evaluating things pretty wrong. In such cases, ANALYZE <TABLE> on all related tables can help. If this is not the case, you can set the enable_seqscan variable to OFF to force postgres to use the index whenever it is technically possible, because sometimes index scanning will be used when sequential scanning works better.

Postgres doesn't use index when scanning index much better

Index Scan → Raster Index Scan → Sequential Scan

Beware of forcing query plans

Last resort measure

Alternative request

More articles: