Speed up complex postgres requests in rails application

Question

Speed up complex postgres requests in rails application

I have a view in my application that renders a lot of data, and in the backend, the data obtained with this query:

DataPoint Load (20394.8ms) SELECT communities.id as com, consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" INNER JOIN "clusterings" ON "clusterings"."id" = "communities"."clustering_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 AND "clusterings"."id" = 1 GROUP BY communities.id, consumers.id [["timestamp", "2015-11-20 09:23:00"], ["timestamp", "2015-11-27 09:23:00"], ["interval_id", 2]]

The task takes about 20 seconds, which seems a bit excessive.

The code for generating the request is as follows:

 res = {} DataPoint.joins(consumer: {communities: :clustering} ) .where('clusterings.id': self, timestamp: chart_cookies[:start_date] .. chart_cookies[:end_date], interval_id: chart_cookies[:interval_id]) .group('communities.id') .group('consumers.id') .select('communities.id as com, consumers.name as con', 'array_agg(timestamp ORDER BY data_points.timestamp asc) as tims', 'array_agg(consumption ORDER BY data_points.timestamp ASC) as cons') .each do |d| res[d.com] ||= {} res[d.com][d.con] = d.tims.zip(d.cons) res[d.com]["aggregate"] ||= d.tims.map{|t| [t,0]} res[d.com]["aggregate"] = res[d.com]["aggregate"].zip(d.cons).map{|(a,b),d| [a,(b+d)]} end res

And the corresponding database models are as follows:

  create_table "data_points", force: :cascade do |t| t.bigint "consumer_id" t.bigint "interval_id" t.datetime "timestamp" t.float "consumption" t.float "flexibility" t.datetime "created_at", null: false t.datetime "updated_at", null: false t.index ["consumer_id"], name: "index_data_points_on_consumer_id" t.index ["interval_id"], name: "index_data_points_on_interval_id" t.index ["timestamp", "consumer_id", "interval_id"], name: "index_data_points_on_timestamp_and_consumer_id_and_interval_id", unique: true t.index ["timestamp"], name: "index_data_points_on_timestamp" end create_table "consumers", force: :cascade do |t| t.string "name" t.string "location" t.string "edms_id" t.bigint "building_type_id" t.bigint "connection_type_id" t.float "location_x" t.float "location_y" t.string "feeder_id" t.bigint "consumer_category_id" t.datetime "created_at", null: false t.datetime "updated_at", null: false t.index ["building_type_id"], name: "index_consumers_on_building_type_id" t.index ["connection_type_id"], name: "index_consumers_on_connection_type_id" t.index ["consumer_category_id"], name: "index_consumers_on_consumer_category_id" end create_table "communities_consumers", id: false, force: :cascade do |t| t.bigint "consumer_id", null: false t.bigint "community_id", null: false t.index ["community_id", "consumer_id"], name: "index_communities_consumers_on_community_id_and_consumer_id" t.index ["consumer_id", "community_id"], name: "index_communities_consumers_on_consumer_id_and_community_id" end create_table "communities", force: :cascade do |t| t.string "name" t.text "description" t.bigint "clustering_id" t.datetime "created_at", null: false t.datetime "updated_at", null: false t.index ["clustering_id"], name: "index_communities_on_clustering_id" end create_table "clusterings", force: :cascade do |t| t.string "name" t.text "description" t.datetime "created_at", null: false t.datetime "updated_at", null: false end

How to make a request faster? Is it possible to reorganize the query to simplify it or add additional code to the database schema to reduce time?

Interestingly, a slightly simplified version of the request, which I use in another view, is much faster, in just 1161.4 ms for the first request and 41.6 ms for the following requests:

 DataPoint Load (1161.4ms) SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 AND "communities"."id" = 100 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-20 09:23:00"], ["timestamp", "2015-11-27 09:23:00"], ["interval_id", 2]]

Using the EXPLAIN (ANALYZE, BUFFERS) command EXPLAIN (ANALYZE, BUFFERS) with a query in dbconsole, I get the following output:

 ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=12.31..7440.69 rows=246 width=57) (actual time=44.139..20474.015 rows=296 loops=1) Group Key: communities.id, consumers.id Buffers: shared hit=159692 read=6148105 written=209 -> Nested Loop (cost=12.31..7434.54 rows=246 width=57) (actual time=20.944..20436.806 rows=49728 loops=1) Buffers: shared hit=159685 read=6148105 written=209 -> Nested Loop (cost=11.88..49.30 rows=1 width=49) (actual time=0.102..6.374 rows=296 loops=1) Buffers: shared hit=988 read=208 -> Nested Loop (cost=11.73..41.12 rows=1 width=57) (actual time=0.084..4.443 rows=296 loops=1) Buffers: shared hit=396 read=208 -> Merge Join (cost=11.58..40.78 rows=1 width=24) (actual time=0.075..1.365 rows=296 loops=1) Merge Cond: (communities_consumers.community_id = communities.id) Buffers: shared hit=5 read=7 -> Index Only Scan using index_communities_consumers_on_community_id_and_consumer_id on communities_consumers (cost=0.27..28.71 rows=296 width=16) (actual time=0.039..0.446 rows=296 loops=1) Heap Fetches: 4 Buffers: shared hit=1 read=6 -> Sort (cost=11.31..11.31 rows=3 width=16) (actual time=0.034..0.213 rows=247 loops=1) Sort Key: communities.id Sort Method: quicksort Memory: 25kB Buffers: shared hit=4 read=1 -> Bitmap Heap Scan on communities (cost=4.17..11.28 rows=3 width=16) (actual time=0.026..0.027 rows=6 loops=1) Recheck Cond: (clustering_id = 1) Heap Blocks: exact=1 Buffers: shared hit=4 read=1 -> Bitmap Index Scan on index_communities_on_clustering_id (cost=0.00..4.17 rows=3 width=0) (actual time=0.020..0.020 rows=8 loops=1) Index Cond: (clustering_id = 1) Buffers: shared hit=3 read=1 -> Index Scan using consumers_pkey on consumers (cost=0.15..0.33 rows=1 width=33) (actual time=0.007..0.008 rows=1 loops=296) Index Cond: (id = communities_consumers.consumer_id) Buffers: shared hit=391 read=201 -> Index Only Scan using clusterings_pkey on clusterings (cost=0.15..8.17 rows=1 width=8) (actual time=0.004..0.005 rows=1 loops=296) Index Cond: (id = 1) Heap Fetches: 296 Buffers: shared hit=592 -> Index Scan using index_data_points_on_consumer_id on data_points (cost=0.44..7383.44 rows=180 width=24) (actual time=56.128..68.995 rows=168 loops=296) Index Cond: (consumer_id = consumers.id) Filter: (("timestamp" >= '2015-11-20 09:23:00'::timestamp without time zone) AND ("timestamp" <= '2015-11-27 09:23:00'::timestamp without time zone) AND (interval_id = 2)) Rows Removed by Filter: 76610 Buffers: shared hit=158697 read=6147897 written=209 Planning time: 1.811 ms Execution time: 20474.330 ms (40 rows)

The bullet driver returns the following warnings:

 USE eager loading detected Community => [:communities_consumers] Add to your finder: :includes => [:communities_consumers] USE eager loading detected Community => [:consumers] Add to your finder: :includes => [:consumers]

After removing the connection to the cluster table, the new query plan is as follows:

 EXPLAIN for: SELECT communities.id as com, consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 AND "communities"."clustering_id" = 1 GROUP BY communities.id, consumers.id [["timestamp", "2015-11-29 20:52:30.926247"], ["timestamp", "2015-12-06 20:52:30.926468"], ["interval_id", 2]] QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=10839.79..10846.42 rows=241 width=57) -> Sort (cost=10839.79..10840.39 rows=241 width=57) Sort Key: communities.id, consumers.id -> Nested Loop (cost=7643.11..10830.26 rows=241 width=57) -> Nested Loop (cost=11.47..22.79 rows=1 width=49) -> Hash Join (cost=11.32..17.40 rows=1 width=16) Hash Cond: (communities_consumers.community_id = communities.id) -> Seq Scan on communities_consumers (cost=0.00..4.96 rows=296 width=16) -> Hash (cost=11.28..11.28 rows=3 width=8) -> Bitmap Heap Scan on communities (cost=4.17..11.28 rows=3 width=8) Recheck Cond: (clustering_id = 1) -> Bitmap Index Scan on index_communities_on_clustering_id (cost=0.00..4.17 rows=3 width=0) Index Cond: (clustering_id = 1) -> Index Scan using consumers_pkey on consumers (cost=0.15..5.38 rows=1 width=33) Index Cond: (id = communities_consumers.consumer_id) -> Bitmap Heap Scan on data_points (cost=7631.64..10805.72 rows=174 width=24) Recheck Cond: ((consumer_id = consumers.id) AND ("timestamp" >= '2015-11-29 20:52:30.926247'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926468'::timestamp without time zone)) Filter: (interval_id = 2::bigint) -> BitmapAnd (cost=7631.64..7631.64 rows=861 width=0) -> Bitmap Index Scan on index_data_points_on_consumer_id (cost=0.00..1589.92 rows=76778 width=0) Index Cond: (consumer_id = consumers.id) -> Bitmap Index Scan on index_data_points_on_timestamp (cost=0.00..6028.58 rows=254814 width=0) Index Cond: (("timestamp" >= '2015-11-29 20:52:30.926247'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926468'::timestamp without time zone)) (23 rows)

As stated in the comments, these are query plans for a simplified query with a restriction on communities.id

and without him

  DataPoint Load (1563.3ms) SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]] EXPLAIN for: SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]] QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=140992.34..142405.51 rows=51388 width=49) -> Sort (cost=140992.34..141120.81 rows=51388 width=49) Sort Key: communities.id, consumers.name -> Hash Join (cost=10135.44..135214.45 rows=51388 width=49) Hash Cond: (data_points.consumer_id = consumers.id) -> Bitmap Heap Scan on data_points (cost=10082.58..134455.00 rows=51388 width=24) Recheck Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint)) -> Bitmap Index Scan on index_data_points_on_timestamp_and_consumer_id_and_interval_id (cost=0.00..10069.74 rows=51388 width=0) Index Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint)) -> Hash (cost=49.16..49.16 rows=296 width=49) -> Hash Join (cost=33.06..49.16 rows=296 width=49) Hash Cond: (communities_consumers.community_id = communities.id) -> Hash Join (cost=8.66..20.69 rows=296 width=49) Hash Cond: (consumers.id = communities_consumers.consumer_id) -> Seq Scan on consumers (cost=0.00..7.96 rows=296 width=33) -> Hash (cost=4.96..4.96 rows=296 width=16) -> Seq Scan on communities_consumers (cost=0.00..4.96 rows=296 width=16) -> Hash (cost=16.40..16.40 rows=640 width=8) -> Seq Scan on communities (cost=0.00..16.40 rows=640 width=8) (19 rows)

and

  DataPoint Load (1479.0ms) SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]] EXPLAIN for: SELECT consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 GROUP BY communities.id, consumers.name [["timestamp", "2015-11-29 20:52:30.926000"], ["timestamp", "2015-12-06 20:52:30.926000"], ["interval_id", 2]] QUERY PLAN --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=140992.34..142405.51 rows=51388 width=49) -> Sort (cost=140992.34..141120.81 rows=51388 width=49) Sort Key: communities.id, consumers.name -> Hash Join (cost=10135.44..135214.45 rows=51388 width=49) Hash Cond: (data_points.consumer_id = consumers.id) -> Bitmap Heap Scan on data_points (cost=10082.58..134455.00 rows=51388 width=24) Recheck Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint)) -> Bitmap Index Scan on index_data_points_on_timestamp_and_consumer_id_and_interval_id (cost=0.00..10069.74 rows=51388 width=0) Index Cond: (("timestamp" >= '2015-11-29 20:52:30.926'::timestamp without time zone) AND ("timestamp" <= '2015-12-06 20:52:30.926'::timestamp without time zone) AND (interval_id = 2::bigint)) -> Hash (cost=49.16..49.16 rows=296 width=49) -> Hash Join (cost=33.06..49.16 rows=296 width=49) Hash Cond: (communities_consumers.community_id = communities.id) -> Hash Join (cost=8.66..20.69 rows=296 width=49) Hash Cond: (consumers.id = communities_consumers.consumer_id) -> Seq Scan on consumers (cost=0.00..7.96 rows=296 width=33) -> Hash (cost=4.96..4.96 rows=296 width=16) -> Seq Scan on communities_consumers (cost=0.00..4.96 rows=296 width=16) -> Hash (cost=16.40..16.40 rows=640 width=8) -> Seq Scan on communities (cost=0.00..16.40 rows=640 width=8) (19 rows)

+5

performance sql ruby-on-rails postgresql

user000001 Nov 27 '17 at 10:41

source share

6 answers

nekogami · Answer 1 · 2017-11-30T10:07:51+0000

You tried to add an index:

"data_points" .timestamp "+" data_points ".consumer_id"

OR

data_points ".consumer_id only?

ravioli · Answer 2 · 2017-12-01T11:18:48+0000

What version of Postgres are you using? In Postgres 10, they introduced their own table partition. If the table "data_points" is very large, this can significantly speed up your query, since you are looking at the time range:

 WHERE (data_points.TIMESTAMP BETWEEN $1 AND $2)

One strategy you can explore is to add partitioning to the DATE of the "timestamp" field. Then modify your query to include an additional filter so that the partitioning is:

 WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND (CAST("data_points"."timestamp" AS DATE) BETWEEN CAST($1 AS DATE) AND CAST($2 AS DATE)) AND "data_points"."interval_id" = $3 AND "data_points"."interval_id" = $3 AND "communities"."clustering_id" = 1

If your "data_points" table is very large and the filtering range of "Timestamp" is small, this should help, because it quickly filters out blocks of rows that do not need to be processed.

I did not do this in Postgres, so I'm not sure how useful it is, blah blah blah, it is. But this is something to see in :)

https://www.postgresql.org/docs/10/static/ddl-partitioning.html#DDL-PARTITIONING-DECLARATIVE

Adam ownczarczyk · Answer 3 · 2017-11-29T15:20:13+0000

Do you have a foreign key on clusterings_id? Also - try changing your state as follows:

 SELECT communities.id as com, consumers.name as con, array_agg(timestamp ORDER BY data_points.timestamp asc) as tims, array_agg(consumption ORDER BY data_points.timestamp ASC) as cons FROM "data_points" INNER JOIN "consumers" ON "consumers"."id" = "data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" WHERE ("data_points"."timestamp" BETWEEN $1 AND $2) AND "data_points"."interval_id" = $3 AND "communities"."clustering_id" = 1 GROUP BY communities.id, consumers.id

Edmund lee · Answer 4 · 2017-11-30T19:00:26+0000

You do not need to join clusterings . So try removing this from your query and use communities.clustering_id = 1 instead. This should get rid of 3 steps in your query plan. This should save you the most, since you request a plan, perform several index checks on it within three nested loops.
You can also try customizing your timestamp aggregate. I assume you do not need to collect them at the level of seconds?
I would also delete the index "index_data_points_on_timestamp" , since you already have a composite index. And it's practically useless. This should improve the performance of your recording.

Victor Di Leo · Answer 5 · 2017-12-05T19:54:03+0000

The index on data_points.timestamp is not used, possibly due to the :: timestamp conversion.

I wonder if changing the data type of a column or creating a functional index will help.

EDIT - The datetime in your CREATE TABLE is how Rails chooses to display the Postgres timestamp data type, I think, and therefore there can be no conversion happening in the end.

However, the index in timestamp is not used, but depending on your data distribution, this may be a very reasonable choice for the optimizer.

khusnetdinov · Answer 6 · 2017-12-06T09:59:57+0000

So, we have Postgres 9.3 and a long request. Good. Before requesting, you should make sure that you have the optimal settings for your database and are suitable for your read and write percentage on a disk, such as an ssd disk or an old hard disk, and you do not switch autovacuum, you check the inflation of tables and indexes and You have good selectivity for indexes that are used to build optimal plans.

Check line types and size filled with string. Change row type and reduce table size and time.

So now you provide in all of this. Now let's think about how Postgres runs and how we can reduce time and effort. ORM is suitable for simple queries, but if you are trying to execute a complex query, you need to use execute by sql methods and enter Query Service Objects .

Write simpler queries in sql Postgres also spend time on collapsible queries.

Check indexes across all join fields. Use explain analyze to verify that you now have the best scanning methods.

The next point. You are trying to make 4 joins! Postgres are trying to find the optimal query plan in 4! times (4 factorial times!) let them think of using subqueries or tables with a predefined table for this choice.

Use a separate query or function for 4 joins (try subqueries):

 SELECT * FROM "data_points" as predefined INNER JOIN "consumers" ON "consumers"."id" ="data_points"."consumer_id" INNER JOIN "communities_consumers" ON "communities_consumers"."consumer_id" = "consumers"."id" INNER JOIN "communities" ON "communities"."id" = "communities_consumers"."community_id" INNER JOIN "clusterings" ON "clusterings"."id" "communities"."clustering_id" WHERE "data_points"."interval_id" = 2 AND "clusterings"."id" = 1

2) Next (don't use variables just)

 SELECT * FROM predefined WHERE "data_points"."timestamp" BETWEEN "2015-11-20 09:23:00" AND 2015-11-27 09:23:00

3) You have 3 times to request data_points for the request, you need less:

 array_agg(timestamp ORDER BY data_points.timestamp asc) as tims array_agg(consumption ORDER BY data_points.timestamp ASC) as cons WHERE ("data_points"."timestamp" BETWEEN $1 AND $2)

You have to remember a lengthy query, it's not all about the query, about settings, using ORM, sql and how Postgres works with all this.

Speed ​​up complex postgres requests in rails application

More articles:

Speed up complex postgres requests in rails application