The PostgreSQL Multi-Column index is combined with comparison operators ("<" and ">")
I am trying to use the btree multi-column index in PostgreSQL to make an annoying join between two tables.
Table "revision_main" Column | Type | Modifiers ----------------+------------------------+----------- revision_id | integer | page_id | integer | Indexes: "revision_main_pkey" UNIQUE, btree (revision_id) "revision_main_cluster_idx" btree (page_id, "timestamp") CLUSTER This table contains revisions (~ 300 million rows) per page on the wiki. My table has more columns, but I dropped them for this example because they should not matter.
Table "revert" Column | Type | Modifiers --------------------+---------+----------- page_id | integer | revision_id | integer | reverted_to | integer | Indexes: "revert_page_between_idx" btree (page_id, reverted_to, revision_id) CLUSTER This table contains version fixes (~ 22 million rows). If revisions were canceled, then version_id will have a row in the revision_main table, and its revision_id will be between reverted_to and revision_id, and it will also have the same page_id. (See http://en.wikipedia.org/wiki/Wikipedia:Revert if you are interested.
Combining these two tables to get fixed changes seems easy. Here is what I came up with:
explain SELECT r.revision_id, rvt.revision_id FROM revision_main r INNER JOIN revert rvt ON r.page_id = rvt.page_id AND r.revision_id > rvt.reverted_to AND r.revision_id < rvt.revision_id; QUERY PLAN ---------------------------------------------------------------------------------------------------- Merge Join (cost=4202878.87..15927491478.57 rows=88418194298 width=8) Merge Cond: (r.page_id = rvt.page_id) Join Filter: ((r.revision_id > rvt.reverted_to) AND (r.revision_id < rvt.revision_id)) -> Index Scan using revision_main_page_id_idx on revision_main r (cost=0.00..9740790.61 rows=223163392 width=8) -> Materialize (cost=4201592.06..4536465.21 rows=26789852 width=12) -> Sort (cost=4201592.06..4268566.69 rows=26789852 width=12) Sort Key: rvt.page_id -> Seq Scan on revert rvt (cost=0.00..438534.52 rows=26789852 width=12) Although the clustered index upon return must be a Btree index (and thus support comparison operators such as "<" and ">"), the query optimizer does not use the index to combine, and the "explanation" predicts the total cost of more 15 billion (can be done next year).
Are comparison operators mapped to multi-column (btree) indices? Am I just doing it wrong?
It seems that the optimizer knows his job better than you.
If you choose more than a small part of the table (what proportion depends on the equipment, let them say 5%), then itβs faster to select and order the whole table than to use the index. If you just highlighted a few lines, then it should use an index. Thus, it gives you the right query plan for your data.
As for the total cost, these numbers are all BS and are useful only when comparing with each other within the same query. (The total costs caused by two very similar requests can be on very different scales.) The lead time and cost of the request are largely unrelated.
Your query (SQL-based) looks like it should read the entire cancellation table and find the corresponding revision lines for each row in the cancellation table.
Since the entire reverse table needs to be read, its sequential scan is appropriate. It seems to be expecting about the correct number of rows.
Each row returned will then correspond to the number of revisions that, in his opinion, are best done by scanning the index and merging the union. It is estimated that, on average, each row returned will correspond to approximately 3,300 revisions, resulting in 88 billion rows.
I do not know how to quickly select 88 billion rows.
To get a more accurate estimate, you'll need a way to convince PostgreSQL that less than 3,300 fixes are available with each return.
You say that after repeated revisions you indicate that each revision should appear only once, even if it is included in several returns.
So try using EXISTS (subquery) instead of INNER JOIN
This will not give you repeated changes:
EXPLAIN SELECT r.revision_id FROM revision_main r WHERE EXISTS (SELECT 1 FROM revert rvt WHERE r.page_id = rvt.page_id AND r.revision_id > rvt.reverted_to AND r.revision_id < rvt.revision_id);