Does Postgres ANTI-JOIN need table scans?

I need ANTI-JOIN (there is no SELECT from the table ... / left join table WHERE table.id IS NULL) in the same table. In fact, I have an index to serve a non-existent question, but the query planner chooses to use a raster map scan.

The table contains 100 million rows, so the heap scan was ruined ...

It would be really fast if Postgres could compare with pointers. Should Postgres visit a table for this ANTI-JOIN?

I know that at some point the table should be visited to serve MVCC, but why so early? Could NOT EXIST only be fixed by the table, because it might skip something otherwise?

+4
source share
2 answers

You will need to provide version information, and jmz is talking about the release of EXPLAIN ANALYZE to get useful tips.

Franz - do not think whether it is possible to check and know.

This is v9.0:

CREATE TABLE tl (i int, t text); CREATE TABLE tr (i int, t text); INSERT INTO tl SELECT s, 'text ' || s FROM generate_series(1,999999) s; INSERT INTO tr SELECT s, 'text ' || s FROM generate_series(1,999999) s WHERE s % 3 = 0; ALTER TABLE tl add primary key (i); CREATE INDEX tr_i_idx ON tr (i); ANALYSE; EXPLAIN ANALYSE SELECT i,t FROM tl LEFT JOIN tr USING (i) WHERE tr.i IS NULL; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------- Merge Anti Join (cost=0.95..45611.86 rows=666666 width=15) (actual time=0.040..4011.970 rows=666666 loops=1) Merge Cond: (tl.i = tr.i) -> Index Scan using tl_pkey on tl (cost=0.00..29201.32 rows=999999 width=15) (actual time=0.017..1356.996 rows=999999 lo -> Index Scan using tr_i_idx on tr (cost=0.00..9745.27 rows=333333 width=4) (actual time=0.015..439.087 rows=333333 loop Total runtime: 4602.224 ms 

What you see will depend on your version and the statistics that the scheduler sees.

+7
source

My (simplified) request:

 SELECT a.id FROM a LEFT JOIN b ON b.id = a.id WHERE b.id IS NULL ORDER BY id; 

This query plan works as follows:

  QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Merge Anti Join (cost=0.57..3831.88 rows=128092 width=8) Merge Cond: (a.id = b.id) -> Index Only Scan using a_pkey on a (cost=0.42..3399.70 rows=130352 width=8) -> Index Only Scan using b_pkey on b (cost=0.15..78.06 rows=2260 width=8) (4 rows) 

However, sometimes postgresql 9.5.9 switches to sequential scanning if the scheduler thinks it might be better (see Why does PostgreSQL perform sequential scanning on an indexed column? ). However, in my case, this made matters worse.

  QUERY PLAN ------------------------------------------------------------------------------------------------------------------------- Merge Anti Join (cost=405448.22..39405858.08 rows=1365191502 width=8) Merge Cond: (a.id = b.id) -> Index Only Scan using a_pkey on a (cost=0.58..35528317.86 rows=1368180352 width=8) -> Materialize (cost=405447.64..420391.89 rows=2988850 width=8) -> Sort (cost=405447.64..412919.76 rows=2988850 width=8) Sort Key: b.id -> Seq Scan on b (cost=0.00..43113.50 rows=2988850 width=8) (7 rows) 

My solution (hack) was to prevent sequential scans:

 set enable_seqscan to off; 

The postgresql documentation says that the correct way to do this is seq_page_cost using ALTER TABLESPACE. This might be useful when using ORDER BY for indexed columns, but I'm not sure. https://www.postgresql.org/docs/9.1/static/runtime-config-query.html

0
source

Source: https://habr.com/ru/post/1343340/


All Articles