Index Scan β Raster Index Scan β Sequential Scan
For several rows, it pays to start index scan. When more rows are returned (a higher percentage of the table and depending on the data distribution, frequency of values ββand row width), it becomes more likely to find several rows on one data page. Then it pays to switch to scanning raster indexes. As soon as a large percentage of data pages should be visited anyway, itβs cheaper to run sequential scans, fill out redundant filter lines and generally skip overhead for indexes.
Postgres switches to sequential scanning, expecting to find rows=263962 , which is already 3% of the entire table. (So ββfar only rows=47935 , see below.)
More in this related answer:
Beware of forcing query plans
You cannot force a specific scheduler method directly in Postgres, but you can make other methods extremely expensive for debugging purposes. See Planning Method Configuration in the manual.
SET enable_seqscan = off (as suggested in another answer) does this for sequential scanning. But this is only for debugging in your session. Do not use this as a general setup in production unless you know exactly what you are doing. This can make funny query plans. Quote guide :
These configuration options provide a rough method of influencing query plans selected by the query optimizer. If the default plan chosen by the optimizer for a particular request is not optimal, a workaround is to use one of these configuration parameters to force the optimizer to choose a different plan. The best ways to improve the quality of plans selected by the optimizer include (see Section 18.7.2 ) by running ANALYZE manually, increasing the default_statistics_target configuration parameter value and increasing the amount of statistics collected for using ALTER TABLE SET STATISTICS .
This is the most necessary advice.
In this particular case, Postgres expects 5-6 times more hits on email_activities.email_recipient_id than it actually is:
estimated rows=227007 compared to actual ... rows=40789
rated rows=263962 vs actual ... rows=47935
If you run this query often, it will pay for ANALYZE look at a larger sample for more accurate statistics for a particular column. Your table is large (~ 10M rows), so do this:
ALTER TABLE email_activities ALTER COLUMN email_recipient_id SET STATISTICS 3000;
Then ANALYZE email_activities;
Last resort measure
In very rare cases, you can resort to a forced index with SET LOCAL enable_seqscan = off in a single transaction or in a function with its own environment. How:
CREATE OR REPLACE FUNCTION f_count_dist_recipients(_email_campaign_id int, _limit int) RETURNS bigint AS $func$ SELECT COUNT(DISTINCT a.email_recipient_id) FROM email_activities a WHERE a.email_recipient_id IN ( SELECT id FROM email_recipients WHERE email_campaign_id = $1 LIMIT $2)
This parameter applies only to the local scope of the function.
Warning: This is just a proof of concept. Even this much less radical manual intervention can bite you in the long run. Values, frequency values, circuits, global Postgres settings, everything changes with time. You are about to upgrade to the new version of Postgres. Now the query plan that you are making now can be a very bad idea.
And usually, this is just a workaround for the problem with your setup. Better to find and fix.
Alternative request
There is no significant information in the question, but this equivalent query is probably faster and more likely to use the index ( email_recipient_id ) - more and more for LIMIT .
SELECT COUNT(*) AS ct FROM ( SELECT id FROM email_recipients WHERE email_campaign_id = 1607 LIMIT 43000 ) r WHERE EXISTS ( SELECT 1 FROM email_activities WHERE email_recipient_id = r.id);