CTE scanning is much slower than it should be

The following query runs about 4 million rows. The first two CTE statements are completed in about an hour. The final, however, has been going on for over 15 years.

WITH parsed AS ( SELECT name, array(...) description FROM import ), counts AS ( SELECT unnest(description) token, count(*) FROM parsed GROUP BY 1 ) INSERT INTO table (name, description) SELECT name, ARRAY( SELECT ROW(token, count)::a FROM ( SELECT token, ( SELECT count FROM counts WHERE a.token=counts.token ) FROM UNNEST(description) a(token) ) _ )::a[] description FROM parsed; QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------------------- Insert on table (cost=55100824.40..162597717038.41 rows=3611956 width=96) CTE parsed -> Seq Scan on import (cost=0.00..51425557.67 rows=3611956 width=787) Filter: ((name IS NOT NULL) AND (description IS NOT NULL)) SubPlan 1 -> HashAggregate (cost=11.59..12.60 rows=101 width=55) -> Append (cost=0.00..11.34 rows=101 width=55) -> Result (cost=0.00..0.01 rows=1 width=0) -> Index Scan using import_aliases_mid_idx on import_aliases (cost=0.00..10.32 rows=100 width=56) Index Cond: (mid = "substring"(import.mid, 5)) SubPlan 2 -> HashAggregate (cost=0.78..1.30 rows=100 width=0) -> Result (cost=0.00..0.53 rows=100 width=0) CTE counts -> HashAggregate (cost=3675165.23..3675266.73 rows=20000 width=32) -> CTE Scan on parsed (cost=0.00..1869187.23 rows=361195600 width=32) -> CTE Scan on parsed (cost=0.00..162542616214.01 rows=3611956 width=96) SubPlan 6 -> Function Scan on unnest a (cost=0.00..45001.25 rows=100 width=32) SubPlan 5 -> CTE Scan on counts (cost=0.00..450.00 rows=100 width=8) Filter: (a.token = token) 

There are about 4 million lines in parsed and counts . The query is executed, and the final statement inserts a line approximately every 2 minutes. He barely touched the drive, but ate the CPU like crazy, and I'm confused.

What is wrong with the request?

The end statement must look for each description element in counts , converting something like this [a,b,c] into something like [(a,9),(b,4),(c,0)] and insert it.


Edit

When analyzing and counting in the form of tables and token in indexed counts, this is the plan:

 explain INSERT INTO table (name, mid, description) SELECT name, mid, ARRAY(SELECT ROW(token, count)::a FROM (SELECT token, (SELECT count FROM counts WHERE a.token=counts.token) FROM UNNEST(description) a(token)) _)::a[] description FROM parsed; QUERY PLAN ------------------------------------------------------------------------------------------------------ Insert on table (cost=0.00..5761751808.75 rows=4002061 width=721) -> Seq Scan on parsed (cost=0.00..5761751808.75 rows=4002061 width=721) SubPlan 2 -> Function Scan on unnest a (cost=0.00..1439.59 rows=100 width=32) SubPlan 1 -> Index Scan using counts_token_idx on counts (cost=0.00..14.39 rows=1 width=4) Index Cond: (a.token = token) 

Which is much more reasonable. Arrays have an average of 57 elements, so I think it was just a huge amount of searches against the supposedly rather inefficient CTE table, which was killing performance. Now it goes at a speed of 300 lines per second, which I really like.

+4
source share
2 answers

As indicated in my editing question, with parsing and counting in the form of tables, and the token in the counts is indexed much faster. I assumed that CTE connections were smarter than them.

+2
source

So, you don't understand and rearrange 4M arrays, right?

I assume that you are running out of RAM, so I think you have several options. The first is moving data in stages between tables to minimize this problem.

Can you tell if this is due to processor binding or I / O binding?

+1
source

Source: https://habr.com/ru/post/1439258/


All Articles