Aggregate query in 50M + row table in PostgreSQL

Problem Statement

I have a table "event_statistics" with the following definition:

CREATE TABLE public.event_statistics (
    id int4 NOT NULL DEFAULT nextval('event_statistics_id_seq'::regclass),
    client_id int4 NULL,
    session_id int4 NULL,
    action_name text NULL,
    value text NULL,
    product_id int8 NULL,
    product_options jsonb NOT NULL DEFAULT '{}'::jsonb,
    url text NULL,
    url_options jsonb NOT NULL DEFAULT '{}'::jsonb,
    visit int4 NULL DEFAULT 0,
    date_update timestamptz NULL,
CONSTRAINT event_statistics_pkey PRIMARY KEY (id),
CONSTRAINT event_statistics_client_id_session_id_sessions_client_id_id_for 
FOREIGN KEY 
(client_id,session_id) REFERENCES <?>() ON DELETE CASCADE ON UPDATE CASCADE
)
WITH (
    OIDS=FALSE
) ;
CREATE INDEX regdate ON public.event_statistics (date_update 
timestamptz_ops) ;

And the clients table:

CREATE TABLE public.clients (
    id int4 NOT NULL DEFAULT nextval('clients_id_seq'::regclass),
    client_name text NULL,
    client_hash text NULL,
CONSTRAINT clients_pkey PRIMARY KEY (id)
)
WITH (
    OIDS=FALSE
) ;
CREATE INDEX clients_client_name_idx ON public.clients (client_name 
text_ops) ;

I need to get the number of events in the "event_statistics" table of each type "action_name" for a specific grouping of the ranges "date_update" with "action_name" and a specific time step and all this for a specific client.

The goal is to provide statistics for all relevant events for each client on their information panel on our website with the ability to select report dates and depending on the interval of the time interval in the chart should be different:

  • current day - counting for each hour;
  • 1+ day and <= 1 month - calculation for each day;
  • 1+ <= 6 - ;
  • 6+ - .

:

SELECT t.date, A.actionName, count(E.id)
FROM generate_series(current_date - interval '1 week',now(),interval '1 
day') as t(date) cross join
(values
('page_open'),
('product_add'),
('product_buy'),
('product_event'),
('product_favourite'),
('product_open'),
('product_share'),
('session_start')) as A(actionName) left join
(select action_name,date_trunc('day',e.date_update) as dateTime, e.id 
from event_statistics as e 
where e.client_id = (select id from clients as c where c.client_name = 
'client name') and 
(date_update between (current_date - interval '1 week') and now())) E 
on t.date = E.dateTime and A.actionName = E.action_name
group by A.actionName,t.date
order by A.actionName,t.date;

, 10 . , , , , ( , , , ).

:

GroupAggregate  (cost=171937.16..188106.84 rows=1600 width=44)
  Group Key: "*VALUES*".column1, t.date
  InitPlan 1 (returns $0)
    ->  Seq Scan on clients c  (cost=0.00..1.07 rows=1 width=4)
          Filter: (client_name = 'client name'::text)
  ->  Merge Left Join  (cost=171936.08..183784.31 rows=574060 width=44)
        Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date =(date_trunc('day'::text, e.date_update))))
        ->  Sort  (cost=628.77..648.77 rows=8000 width=40)
              Sort Key: "*VALUES*".column1, t.date
              ->  Nested Loop  (cost=0.02..110.14 rows=8000 width=40)
                    ->  Function Scan on generate_series t (cost=0.02..10.02 rows=1000 width=8)
                    ->  Materialize  (cost=0.00..0.14 rows=8 width=32)
                          ->  Values Scan on "*VALUES*"  (cost=0.00..0.10 rows=8 width=32)
        ->  Materialize  (cost=171307.32..171881.38 rows=114812 width=24)
              ->  Sort  (cost=171307.32..171594.35 rows=114812 width=24)
                    Sort Key: e.action_name, (date_trunc('day'::text, e.date_update))
                    ->  Index Scan using regdate on event_statistics e (cost=0.57..159302.49 rows=114812 width=24)
                          Index Cond: ((date_update > (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= now()))
                          Filter: (client_id = $0)

"event_statistics" 50 , , .

, . , stackoverflow , , :

  • client_id
  • , , ( , ) -)
  • ( Intel Xeon E7-4850 2.00GHz, 6 -, )
  • OLAP, Postgres-XL - ?

btree event_statistics (client_id asc, action_name asc, date_update asc, id). , , .

?

Update

explain (analyze, verbose):

GroupAggregate  (cost=860934.44..969228.46 rows=1600 width=44) (actual time=52388.678..54671.187 rows=64 loops=1)
  Output: t.date, "*VALUES*".column1, count(e.id)
  Group Key: "*VALUES*".column1, t.date
  InitPlan 1 (returns $0)
    ->  Seq Scan on public.clients c  (cost=0.00..1.07 rows=1 width=4) (actual time=0.058..0.059 rows=1 loops=1)
          Output: c.id
          Filter: (c.client_name = 'client name'::text)
          Rows Removed by Filter: 5
  ->  Merge Left Join  (cost=860933.36..940229.77 rows=3864215 width=44) (actual time=52388.649..54388.698 rows=799737 loops=1)
        Output: t.date, "*VALUES*".column1, e.id
        Merge Cond: (("*VALUES*".column1 = e.action_name) AND (t.date = (date_trunc('day'::text, e.date_update))))
        ->  Sort  (cost=628.77..648.77 rows=8000 width=40) (actual time=0.190..0.244 rows=64 loops=1)
              Output: t.date, "*VALUES*".column1
              Sort Key: "*VALUES*".column1, t.date
              Sort Method: quicksort  Memory: 30kB
              ->  Nested Loop  (cost=0.02..110.14 rows=8000 width=40) (actual time=0.059..0.080 rows=64 loops=1)
                    Output: t.date, "*VALUES*".column1
                    ->  Function Scan on pg_catalog.generate_series t  (cost=0.02..10.02 rows=1000 width=8) (actual time=0.043..0.043 rows=8 loops=1)
                          Output: t.date
                          Function Call: generate_series(((('now'::cstring)::date - '7 days'::interval))::timestamp with time zone, now(), '1 day'::interval)
                    ->  Materialize  (cost=0.00..0.14 rows=8 width=32) (actual time=0.002..0.003 rows=8 loops=8)
                          Output: "*VALUES*".column1
                          ->  Values Scan on "*VALUES*"  (cost=0.00..0.10 rows=8 width=32) (actual time=0.004..0.005 rows=8 loops=1)
                                Output: "*VALUES*".column1
        ->  Materialize  (cost=860304.60..864168.81 rows=772843 width=24) (actual time=52388.441..54053.748 rows=799720 loops=1)
              Output: e.id, e.date_update, e.action_name, (date_trunc('day'::text, e.date_update))
              ->  Sort  (cost=860304.60..862236.70 rows=772843 width=24) (actual time=52388.432..53703.531 rows=799720 loops=1)
                    Output: e.id, e.date_update, e.action_name, (date_trunc('day'::text, e.date_update))
                    Sort Key: e.action_name, (date_trunc('day'::text, e.date_update))
                    Sort Method: external merge  Disk: 39080kB
                    ->  Index Scan using regdate on public.event_statistics e  (cost=0.57..753018.26 rows=772843 width=24) (actual time=31.423..44284.363 rows=799720 loops=1)
                          Output: e.id, e.date_update, e.action_name, date_trunc('day'::text, e.date_update)
                          Index Cond: ((e.date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (e.date_update <= now()))
                          Filter: (e.client_id = $0)
                          Rows Removed by Filter: 2983424
Planning time: 7.278 ms
Execution time: 54708.041 ms
+4
1

: :


EXPLAIN
SELECT cal.theday, act.action_name, SUM(sub.the_count)
FROM generate_series(current_date - interval '1 week', now(), interval '1 
day') as cal(theday) -- calendar pseudo-table
CROSS JOIN (VALUES
        ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        ) AS act(action_name)
LEFT JOIN (
        SELECT es.action_name, date_trunc('day',es.date_update) as theday
                , COUNT(DISTINCT es.id ) AS the_count
        FROM event_statistics as es
        WHERE es.client_id = (SELECT c.id FROM clients AS c
                        WHERE c.client_name = 'client name')
        AND (es.date_update BETWEEN (current_date - interval '1 week') AND now())
        GROUP BY 1,2
        ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name
GROUP BY act.action_name,cal.theday
ORDER BY act.action_name,cal.theday
        ;

: VALUES CTE . ( , )


EXPLAIN
WITH act(action_name) AS (VALUES
        ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        )
SELECT cal.theday, act.action_name, SUM(sub.the_count)
FROM generate_series(current_date - interval '1 week', now(), interval '1day') AS cal(theday)
CROSS JOIN act
LEFT JOIN (
        SELECT es.action_name, date_trunc('day',es.date_update) AS theday
                , COUNT(DISTINCT es.id ) AS the_count
        FROM event_statistics AS es
        WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now()
        AND EXISTS (SELECT * FROM clients cli  WHERE cli.id= es.client_id AND cli.client_name = 'client name')
        AND EXISTS (SELECT * FROM act WHERE act.action_name = es.action_name)
        GROUP BY 1,2
        ) sub ON cal.theday = sub.theday AND act.action_name = sub.action_name
GROUP BY act.action_name,cal.theday
ORDER BY act.action_name,cal.theday
        ;

UPDATE: fysical (temp) .


    -- Final attempt: materialize the carthesian product (timeseries*action_name)
    -- into a temp table
CREATE TEMP TABLE grid AS
(SELECT act.action_name, cal.theday
FROM generate_series(current_date - interval '1 week', now(), interval '1 day')
    AS cal(theday)
CROSS JOIN
    (VALUES ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        ) act(action_name)
    );
CREATE UNIQUE INDEX ON grid(action_name, theday);

    -- Index will force statistics to be collected
    -- ,and will generate better estimates for the numbers of rows
CREATE INDEX iii ON event_statistics (action_name, date_update ) ;
VACUUM ANALYZE grid;
VACUUM ANALYZE event_statistics;

EXPLAIN
SELECT grid.action_name, grid.theday, SUM(sub.the_count) AS the_count
FROM grid
LEFT JOIN (
        SELECT es.action_name, date_trunc('day',es.date_update) AS theday
                , COUNT(*) AS the_count
        FROM event_statistics AS es
        WHERE es.date_update BETWEEN (current_date - interval '1 week') AND now()
        AND EXISTS (SELECT * FROM clients cli  WHERE cli.id= es.client_id AND cli.client_name = 'client name')
        -- AND EXISTS (SELECT * FROM grid WHERE grid.action_name = es.action_name)
        GROUP BY 1,2
        ORDER BY 1,2 --nonsense!
        ) sub ON grid.theday = sub.theday AND grid.action_name = sub.action_name
GROUP BY grid.action_name,grid.theday
ORDER BY grid.action_name,grid.theday
        ;

№ 3 (, (-) , . onthetimestamp)


    -- attempt#4:
    -- - materialize the carthesian product (timeseries*action_name)
    -- - sanitize date interval -logic

CREATE TEMP TABLE grid AS
(SELECT act.action_name, cal.theday::date
FROM generate_series(current_date - interval '1 week', now(), interval '1 day')
    AS cal(theday)
CROSS JOIN
    (VALUES ('page_open')
        , ('product_add') , ('product_buy') , ('product_event')
        , ('product_favourite') , ('product_open') , ('product_share') , ('session_start')
        ) act(action_name)
    );

    -- Index will force statistics to be collected
    -- ,and will generate better estimates for the numbers of rows
-- CREATE UNIQUE INDEX ON grid(action_name, theday);
-- CREATE INDEX iii ON event_statistics (action_name, date_update ) ;
CREATE UNIQUE INDEX ON grid(theday, action_name);
CREATE INDEX iii ON event_statistics (date_update, action_name) ;
VACUUM ANALYZE grid;
VACUUM ANALYZE event_statistics;

EXPLAIN
SELECT gr.action_name, gr.theday
            , COUNT(*) AS the_count
FROM grid gr
LEFT JOIN event_statistics AS es
    ON es.action_name = gr.action_name
    AND date_trunc('day',es.date_update)::date = gr.theday
    AND es.date_update BETWEEN (current_date - interval '1 week') AND current_date
JOIN clients cli  ON cli.id= es.client_id AND cli.client_name = 'client name'
GROUP BY gr.action_name,gr.theday
ORDER BY 1,2
        ;

                                                                        QUERY PLAN                                                                        
----------------------------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=8.33..8.35 rows=1 width=17)
   Group Key: gr.action_name, gr.theday
   ->  Sort  (cost=8.33..8.34 rows=1 width=17)
         Sort Key: gr.action_name, gr.theday
         ->  Nested Loop  (cost=1.40..8.33 rows=1 width=17)
               ->  Nested Loop  (cost=1.31..7.78 rows=1 width=40)
                     Join Filter: (es.client_id = cli.id)
                     ->  Index Scan using clients_client_name_key on clients cli  (cost=0.09..2.30 rows=1 width=4)
                           Index Cond: (client_name = 'client name'::text)
                     ->  Bitmap Heap Scan on event_statistics es  (cost=1.22..5.45 rows=5 width=44)
                           Recheck Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date))
                           ->  Bitmap Index Scan on iii  (cost=0.00..1.22 rows=5 width=0)
                                 Index Cond: ((date_update >= (('now'::cstring)::date - '7 days'::interval)) AND (date_update <= ('now'::cstring)::date))
               ->  Index Only Scan using grid_theday_action_name_idx on grid gr  (cost=0.09..0.54 rows=1 width=17)
                     Index Cond: ((theday = (date_trunc('day'::text, es.date_update))::date) AND (action_name = es.action_name))
(15 rows)
0

Source: https://habr.com/ru/post/1684568/


All Articles