The following query runs about 4 million rows. The first two CTE statements are completed in about an hour. The final, however, has been going on for over 15 years.
WITH parsed AS ( SELECT name, array(...) description FROM import ), counts AS ( SELECT unnest(description) token, count(*) FROM parsed GROUP BY 1 ) INSERT INTO table (name, description) SELECT name, ARRAY( SELECT ROW(token, count)::a FROM ( SELECT token, ( SELECT count FROM counts WHERE a.token=counts.token ) FROM UNNEST(description) a(token) ) _ )::a[] description FROM parsed; QUERY PLAN
There are about 4 million lines in parsed and counts . The query is executed, and the final statement inserts a line approximately every 2 minutes. He barely touched the drive, but ate the CPU like crazy, and I'm confused.
What is wrong with the request?
The end statement must look for each description element in counts , converting something like this [a,b,c] into something like [(a,9),(b,4),(c,0)] and insert it.
Edit
When analyzing and counting in the form of tables and token in indexed counts, this is the plan:
explain INSERT INTO table (name, mid, description) SELECT name, mid, ARRAY(SELECT ROW(token, count)::a FROM (SELECT token, (SELECT count FROM counts WHERE a.token=counts.token) FROM UNNEST(description) a(token)) _)::a[] description FROM parsed; QUERY PLAN
Which is much more reasonable. Arrays have an average of 57 elements, so I think it was just a huge amount of searches against the supposedly rather inefficient CTE table, which was killing performance. Now it goes at a speed of 300 lines per second, which I really like.