How to speed up a slow multiprocessor request in a small Vertica database (~ 120 thousand lines, 10 minutes)

I would really like your help to understand why this query with the combined load takes about 10 minutes to work in a small database of seven tables, 120K, and ideally get your suggestions on how to do this faster on our small cluster from four knots. Here I post the supporting information: https://gist.github.com/anonymous/8862796 (list of tables, list of fields by table and table sizes), but the next request and output is EXPLAIN VERBOSE. I executed ANALYZE_WORKLOAD () in this query, and then executed his suggestion to run ANALYZE_STATISTICS in all tables. This did not lead to improvement. Then I made a second proposal to start the database designer, which led to even lower performance. I am very grateful to you for your help.

Profile Information

Thanks to the tips below re: PROFILE. I ran it and set the results here: https://gist.github.com/anonymous/8935190 . It is 8K lines long, so maybe I did not start it correctly (the details are in essence). Question: How do I start analyzing it?

Request History

The reason the query is messy is mainly due to the fact that it is dynamically generated for each launch of our machine learning research software, which must go through a style chart through ER tables, applying various conditions. In this case, the way [bets, films, bets, ml_user, bets, films, bets]. Requests are gradually created during the study of the solution-space program, therefore (at present) they are not optimizers, which are kindly and correctly suggested below: @wumpz and @Bohemian, for example, excluding subqueries. This means that I’m kind of "stuck with its current form for a short time: - /

 ------------------------------ 
 QUERY PLAN DESCRIPTION: 
 ------------------------------

 Opt Vertica Options
 --------------------
 PLAN_OUTPUT_SUPER_VERBOSE


 EXPLAIN VERBOSE
 SELECT relVarTable0.id AS id, relVarTable1.val, relVarTable2.val 
 FROM (SELECT id FROM rates) relVarTable0
 LEFT JOIN
 (SELECT rates1.id AS id, AVG(rates4.rating) AS val 
 FROM rates rates1, movie movie1, rates rates2, ml_user ml_user1, rates rates3, movie movie2, rates rates4 
 WHERE movie1.id = rates1.movie_id AND movie1.id = rates2.movie_id AND ml_user1.id = rates2.ml_user_id AND ml_user1.id = rates3.ml_user_id AND movie2.id = rates3.movie_id AND movie2.id = rates4.movie_id AND movie1.id <> movie2.id AND rates1.id <> rates2.id AND rates2.id <> rates3.id AND rates3.id <> rates4.id AND rates4.rating IS NOT NULL 
 GROUP BY rates1.id) relVarTable1
  ON relVarTable0.id = relVarTable1.id
 LEFT JOIN
 (SELECT rates1.id AS id, rates1.rating AS val 
 FROM rates rates1 
 WHERE rates1.rating IS NOT NULL ) relVarTable2
  ON relVarTable0.id = relVarTable2.id;

 Access Path:
 Sort Key: (V(1,1))
  LDISTRIB_UNSEGMENTED
 +-JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4489.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5441368.000000 Memory(B): 1209184.000000 Netwrk(B): 1209184.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 1) Inner (RESEGMENT)
 |  Join Cond: (relVarTable0.id = relVarTable2.id)
 |  Execute on: All Nodes
 |  Sort Key: (V(1,1))
 |  LDISTRIB_UNSEGMENTED
 | +-- Outer -> JOIN MERGEJOIN(inputs presorted) [LeftOuter] [Cost: 4197.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 1369200.000000 Memory(B): 0.000000 Netwrk(B): 604600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 2) Outer (RESEGMENT)
 | |      Join Cond: (relVarTable0.id = relVarTable1.id)
 | |      Execute on: All Nodes
 | |      Sort Key: (V(1,1))
 | |      LDISTRIB_UNSEGMENTED
 | | +-- Outer -> SELECT [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 3)
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | |      LDISTRIB_UNSEGMENTED
 | | | +---> STORAGE ACCESS for rates [Cost: 20.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 4)
 | | | |      Column Cost Aspects: [ Disk(B): 196608.000000 CPU(B): 0.000000 Memory(B): 604600.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | |      Projection: movielens_test.rates_b0
 | | | |      Materialize: rates.id
 | | | |      Execute on: All Nodes
 | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | |      LDISTRIB_SEGMENTED
 | | +-- Inner -> SELECT [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 5)
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id)
 | | |      LDISTRIB_UNSEGMENTED
 | | | +---> GROUPBY HASH (SORT OUTPUT) (GLOBAL RESEGMENT GROUPS) (LOCAL RESEGMENT GROUPS) [Cost: 4067.000000, Rows: 10000.000000 Disk(B): 0.000000 CPU(B): 6650600.000000 Memory(B): 640000.000000 Netwrk(B): 6890600.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 6)
 | | | |      Aggregates: sum_float(<SVAR>), count(<SVAR>)
 | | | |      Group By: rates1.id
 | | | |      Execute on: All Nodes
 | | | |      Sort Key: (rates.id)
 | | | |      LDISTRIB_SEGMENTED
 | | | | +---> JOIN HASH [Cost: 2869.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 12091944.000000 Memory(B): 3022960.000000 Netwrk(B): 1813776.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 88] (PATH ID: 7) Inner (RESEGMENT)
 | | | | |      Join Cond: (movie2.id = rates4.movie_id)
 | | | | |      Join Filter: (rates3.id <> rates4.id)
 | | | | |      Execute on: All Nodes
 | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | |      LDISTRIB_UNSEGMENTED
 | | | | | +-- Outer -> JOIN HASH [Cost: 2395.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 9110592.000000 Memory(B): 41592.000000 Netwrk(B): 4246064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 64] (PATH ID: 8) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | |      Join Cond: (movie2.id = rates3.movie_id)
 | | | | | |      Join Filter: (movie1.id <> movie2.id)
 | | | | | |      Execute on: All Nodes
 | | | | | |      Runtime Filter: (SIP1(HashJoin): movie2.id)
 | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | +-- Outer -> JOIN HASH [Cost: 1625.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 10278200.000000 Memory(B): 3023000.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 56] (PATH ID: 9) Inner (RESEGMENT)
 | | | | | | |      Join Cond: (ml_user1.id = rates3.ml_user_id)
 | | | | | | |      Join Filter: (rates2.id <> rates3.id)
 | | | | | | |      Execute on: All Nodes
 | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | |      LDISTRIB_UNSEGMENTED
 | | | | | | | +-- Outer -> JOIN HASH [Cost: 1163.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 5582544.000000 Memory(B): 141144.000000 Netwrk(B): 2465448.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 40] (PATH ID: 10) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | | | |      Join Cond: (ml_user1.id = rates2.ml_user_id)
 | | | | | | | |      Execute on: All Nodes
 | | | | | | | |      Runtime Filter: (SIP2(HashJoin): ml_user1.id)
 | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | +-- Outer -> JOIN HASH [Cost: 711.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 8464400.000000 Memory(B): 2418400.000000 Netwrk(B): 1813800.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 32] (PATH ID: 11) Outer (RESEGMENT)(LOCAL ROUND ROBIN)
 | | | | | | | | |      Join Cond: (movie1.id = rates2.movie_id)
 | | | | | | | | |      Join Filter: (rates1.id <> rates2.id)
 | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates2 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 12)
 | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | | | |      Materialize: rates2.id, rates2.ml_user_id, rates2.movie_id
 | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | +-- Inner -> JOIN HASH [Cost: 268.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 3064592.000000 Memory(B): 41592.000000 Netwrk(B): 1223064.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 13) Outer (RESEGMENT)(LOCAL ROUND ROBIN) Inner (RESEGMENT)
 | | | | | | | | | |      Join Cond: (movie1.id = rates1.movie_id)
 | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | | +-- Outer -> STORAGE ACCESS for rates1 [Cost: 39.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 14)
 | | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 0.000000 Memory(B): 1209200.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | | | | |      Materialize: rates1.id, rates1.movie_id
 | | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | | | +-- Inner -> STORAGE ACCESS for movie1 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 15)
 | | | | | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | | | |      Projection: movielens_test.movie_b0
 | | | | | | | | | | |      Materialize: movie1.id
 | | | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | | | |      Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
 | | | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | | +-- Inner -> STORAGE ACCESS for ml_user1 [Cost: 5.000000, Rows: 5881.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 16)
 | | | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 47048.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | | |      Projection: movielens_test.ml_user_b0
 | | | | | | | | |      Materialize: ml_user1.id
 | | | | | | | | |      Execute on: All Nodes
 | | | | | | | | |      Sort Key: (ml_user.id, ml_user.gender, ml_user.age_range, ml_user.occupation, ml_user.zipcode, ml_user.num_ratings)
 | | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | | +-- Inner -> STORAGE ACCESS for rates3 [Cost: 59.000000, Rows: 75575.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PATH ID: 17)
 | | | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 0.000000 Memory(B): 1813800.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | | |      Projection: movielens_test.rates_b0
 | | | | | | | |      Materialize: rates3.id, rates3.ml_user_id, rates3.movie_id
 | | | | | | | |      Execute on: All Nodes
 | | | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | | +-- Inner -> STORAGE ACCESS for movie2 [Cost: 5.000000, Rows: 1733.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 8] (PATH ID: 18)
 | | | | | | |      Column Cost Aspects: [ Disk(B): 65536.000000 CPU(B): 0.000000 Memory(B): 13864.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | | |      Projection: movielens_test.movie_b0
 | | | | | | |      Materialize: movie2.id
 | | | | | | |      Execute on: All Nodes
 | | | | | | |      Sort Key: (movie.id, movie.title, movie.year, movie.imdb_id, movie.rotten_tomatoes_id, movie.rotten_tomatoes_critic_score, movie.rotten_tomatoes_audience_score, movie.budget, movie.gross, movie.mpaa_rating, movie.runtime, movie.action, movie.adventure, movie.animation, movie.childrens, movie.comedy, movie.crime, movie.documentary, movie.drama, movie.fantasy, movie.film_noir, movie.horror, movie.musical, movie.mystery, movie.romance, movie.sci_fi, movie.thriller, movie.war, movie.western, movie.is_usa, movie.num_actors, movie.num_ratings)
 | | | | | | |      LDISTRIB_SEGMENTED
 | | | | | +-- Inner -> STORAGE ACCESS for rates4 [Cost: 60.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 24] (PUSHED GROUPING) Partial GroupBy: rates4.movie_id,rates4.id Partial Aggs: sum_float(<SVAR>),count(<SVAR>) (PATH ID: 19)
 | | | | | |      Column Cost Aspects: [ Disk(B): 589824.000000 CPU(B): 196608.000000 Memory(B): 1813784.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | | | | |      Projection: movielens_test.rates_b0
 | | | | | |      Materialize: rates4.rating, rates4.id, rates4.movie_id
 | | | | | |      Filter: (rates4.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
 | | | | | |      Execute on: All Nodes
 | | | | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | | | | |      LDISTRIB_SEGMENTED
 | +-- Inner -> SELECT [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 1.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 20)
 | |      Execute on: All Nodes
 | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | |      LDISTRIB_UNSEGMENTED
 | | +---> STORAGE ACCESS for rates1 [Cost: 41.000000, Rows: 75574.000000 Disk(B): 0.000000 CPU(B): 0.000000 Memory(B): 0.000000 Netwrk(B): 0.000000 Parallelism: 4.000000 (NO STATISTICS)] [OutRowSz (B): 16] (PATH ID: 21)
 | | |      Column Cost Aspects: [ Disk(B): 393216.000000 CPU(B): 196608.000000 Memory(B): 1209184.000212 Netwrk(B): 0.000000 Parallelism: 4.000000 ]
 | | |      Projection: movielens_test.rates_b0
 | | |      Materialize: rates1.rating, rates1.id
 | | |      Filter: (rates1.rating IS NOT NULL)/* sel=0.999974 ndv= 500 */
 | | |      Execute on: All Nodes
 | | |      Sort Key: (rates.id, rates.ml_user_id, rates.movie_id, rates.rating)
 | | |      LDISTRIB_SEGMENTED


 ------------------------------ 
+4
source share
2 answers

NO STATISTICS . , .

. ? -, tbl . , - ( ), ( ).

  • DBD
  • ,
  • , movielens_test.rates
    • o node (MPP),
    • , , DDL, cols ( , DBD - )

, :

. , , , , .

- . , .

+7

: rates3.id, rates3.ml_user_id, rates3.movie_id - , .

,

: (SIP2 (HashJoin): ml_user1.id). claus , ( projection_usage)

, , ( ).

RESGMENTED_MANY_ROWS. , /WHERE CLAUSE, . e.empid() = f.empid() , float. ( ) * query_profiles transaction_id < > statement_id;

+1

Source: https://habr.com/ru/post/1526217/


All Articles