I have 3 tables that I want to combine using internal joins in Postgres 9.1, reads, devices, and devices. The following is a shorthand diagram for each table.
reads - ~ 250,000 lines
CREATE TABLE reads ( id serial NOT NULL, device_id integer NOT NULL, value bigint NOT NULL, read_datetime timestamp without time zone NOT NULL, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL, CONSTRAINT reads_pkey PRIMARY KEY (id ) ) WITH ( OIDS=FALSE ); ALTER TABLE reads OWNER TO postgres; CREATE INDEX index_reads_on_device_id ON reads USING btree (device_id ); CREATE INDEX index_reads_on_read_datetime ON reads USING btree (read_datetime );
devices - ~ 500 lines
CREATE TABLE devices ( id serial NOT NULL, serial_number character varying(20) NOT NULL, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL, CONSTRAINT devices_pkey PRIMARY KEY (id ) ) WITH ( OIDS=FALSE ); ALTER TABLE devices OWNER TO postgres; CREATE UNIQUE INDEX index_devices_on_serial_number ON devices USING btree (serial_number COLLATE pg_catalog."default" );
patient_devices - ~ 25,000 rows
CREATE TABLE patient_devices ( id serial NOT NULL, patient_id integer NOT NULL, device_id integer NOT NULL, issuance_datetime timestamp without time zone NOT NULL, unassignment_datetime timestamp without time zone, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL, CONSTRAINT patient_devices_pkey PRIMARY KEY (id ) ) WITH ( OIDS=FALSE ); ALTER TABLE patient_devices OWNER TO postgres; CREATE INDEX index_patient_devices_on_device_id ON patient_devices USING btree (device_id ); CREATE INDEX index_patient_devices_on_issuance_datetime ON patient_devices USING btree (issuance_datetime ); CREATE INDEX index_patient_devices_on_patient_id ON patient_devices USING btree (patient_id ); CREATE INDEX index_patient_devices_on_unassignment_datetime ON patient_devices USING btree (unassignment_datetime );
patients - ~ 1000 lines
CREATE TABLE patients ( id serial NOT NULL, first_name character varying(50) NOT NULL, middle_name character varying(50), last_name character varying(50) NOT NULL, created_at timestamp without time zone NOT NULL, updated_at timestamp without time zone NOT NULL, CONSTRAINT participants_pkey PRIMARY KEY (id ) ) WITH ( OIDS=FALSE ); ALTER TABLE patients OWNER TO postgres;
Here is my abridged request.
SELECT device_patients.patient_id, serial_number FROM reads INNER JOIN devices ON devices.id = reads.device_id INNER JOIN patient_devices ON device_patients.device_id = devices.id WHERE (reads.read_datetime BETWEEN '2012-01-01 10:30:01.000000' AND '2013-05-18 03:03:42') AND (read_datetime > issuance_datetime) AND ((unassignment_datetime IS NOT NULL AND read_datetime < unassignment_datetime) OR (unassignment_datetime IS NULL)) GROUP BY serial_number, patient_devices.patient_id LIMIT 10
Ultimately, it will be a small part of a larger request (without LIMIT, I added only a restriction to prove to myself that a long lead time is not connected with the return of a number of lines), however I did a lot of experimenting and determined that this is the slow part of a larger request . When I run EXPLAIN ANALYZE on this request, I get the following output (also can be viewed here )
Limit (cost=156442.31..156442.41 rows=10 width=13) (actual time=2815.435..2815.441 rows=10 loops=1) -> HashAggregate (cost=156442.31..159114.89 rows=267258 width=13) (actual time=2815.432..2815.437 rows=10 loops=1) -> Hash Join (cost=1157.78..151455.79 rows=997304 width=13) (actual time=30.930..2739.164 rows=250150 loops=1) Hash Cond: (devices.device_id = devices.id) Join Filter: ((reads.read_datetime > patient_devices.issuance_datetime) AND (((patient_devices.unassignment_datetime IS NOT NULL) AND (reads.read_datetime < patient_devices.unassignment_datetime)) OR (patient_devices.unassignment_datetime IS NULL))) -> Seq Scan on reads (cost=0.00..7236.94 rows=255396 width=12) (actual time=0.035..64.433 rows=255450 loops=1) Filter: ((read_datetime >= '2012-01-01 10:30:01'::timestamp without time zone) AND (read_datetime <= '2013-05-18 03:03:42'::timestamp without time zone)) -> Hash (cost=900.78..900.78 rows=20560 width=37) (actual time=30.830..30.830 rows=25015 loops=1) Buckets: 4096 Batches: 1 Memory Usage: 1755kB -> Hash Join (cost=19.90..900.78 rows=20560 width=37) (actual time=0.776..20.551 rows=25015 loops=1) Hash Cond: (patient_devices.device_id = devices.id) -> Seq Scan on patient_devices (cost=0.00..581.93 rows=24893 width=24) (actual time=0.014..7.867 rows=25545 loops=1) Filter: ((unassignment_datetime IS NOT NULL) OR (unassignment_datetime IS NULL)) -> Hash (cost=13.61..13.61 rows=503 width=13) (actual time=0.737..0.737 rows=503 loops=1) Buckets: 1024 Batches: 1 Memory Usage: 24kB -> Seq Scan on devices (cost=0.00..13.61 rows=503 width=13) (actual time=0.016..0.466 rows=503 loops=1) Filter: (entity_id = 2) Total runtime: 2820.392 ms
My question is: how do I speed this up? Right now I am running this on my Windows machine for testing, but eventually it will be deployed to Ubuntu, will this make a difference? Any insight into why this takes 2 seconds would be greatly appreciated.
thanks
It has been suggested that LIMIT may change the query plan. Here is the same query without LIMIT. The slower part still looks like a Hash Join.
In addition, the relevant settings are listed here. Again, I am testing this only now on Windows, and I donβt know what effect this would have on a Linux machine.
shared_buffers = 2GB effective_cache_size = 4 GB work_mem = 256 MB random_page_cost = 2.0
Here are the statistics for the reading table
Statistic Value Sequential Scans 130 Sequential Tuples Read 28865850 Index Scans 283630 Index Tuples Fetched 141421907 Tuples Inserted 255450 Tuples Updated 0 Tuples Deleted 0 Tuples HOT Updated 0 Live Tuples 255450 Dead Tuples 0 Heap Blocks Read 20441 Heap Blocks Hit 3493033 Index Blocks Read 8824 Index Blocks Hit 4840210 Toast Blocks Read Toast Blocks Hit Toast Index Blocks Read Toast Index Blocks Hit Last Vacuum 2013-05-20 09:23:03.782-07 Last Autovacuum Last Analyze 2013-05-20 09:23:03.91-07 Last Autoanalyze 2013-05-17 19:01:44.075-07 Vacuum counter 1 Autovacuum counter 0 Analyze counter 1 Autoanalyze counter 6 Table Size 27 MB Toast Table Size none Indexes Size 34 MB
Here are the statistics for the device table
Statistic Value Sequential Scans 119 Sequential Tuples Read 63336 Index Scans 1053935 Index Tuples Fetched 1053693 Tuples Inserted 609 Tuples Updated 0 Tuples Deleted 0 Tuples HOT Updated 0 Live Tuples 609 Dead Tuples 0 Heap Blocks Read 32 Heap Blocks Hit 1054553 Index Blocks Read 32 Index Blocks Hit 2114305 Toast Blocks Read Toast Blocks Hit Toast Index Blocks Read Toast Index Blocks Hit Last Vacuum Last Autovacuum Last Analyze Last Autoanalyze 2013-05-17 19:02:49.692-07 Vacuum counter 0 Autovacuum counter 0 Analyze counter 0 Autoanalyze counter 2 Table Size 48 kB Toast Table Size none Indexes Size 128 kB
Here are the statistics for the patient_devices table
Statistic Value Sequential Scans 137 Sequential Tuples Read 3065400 Index Scans 853990 Index Tuples Fetched 46143763 Tuples Inserted 25545 Tuples Updated 24936 Tuples Deleted 0 Tuples HOT Updated 0 Live Tuples 25547 Dead Tuples 929 Heap Blocks Read 1959 Heap Blocks Hit 6099617 Index Blocks Read 1077 Index Blocks Hit 2462681 Toast Blocks Read Toast Blocks Hit Toast Index Blocks Read Toast Index Blocks Hit Last Vacuum Last Autovacuum 2013-05-17 19:01:44.576-07 Last Analyze Last Autoanalyze 2013-05-17 19:01:44.697-07 Vacuum counter 0 Autovacuum counter 6 Analyze counter 0 Autoanalyze counter 6 Table Size 2624 kB Toast Table Size none Indexes Size 5312 kB
Below is the full query I'm trying to speed up. The smaller request is really faster, but I could not complete my full request faster, which is reproduced below. As suggested, I added 4 new indexes, UNIQUE (device_id, issuance_datetime), UNIQUE (device_id, issuance_datetime), UNIQUE (patient_id, unassignment_datetime), UNIQUE (patient_id, unassignment_datetime)
SELECT first_name , last_name , MAX(max_read) AS read_datetime , SUM(value) AS value , serial_number FROM ( SELECT pa.first_name , pa.last_name , value , first_value(de.serial_number) OVER(PARTITION BY pa.id ORDER BY re.read_datetime DESC) AS serial_number
Sorry to not post this before, but I thought this query would be too complicated, and tried just the problem, but apparently I still canβt understand. It seems like it would be much faster if I could limit the results returned by the nested selection using the max_read variable, but according to numerous sources, this is not valid in Postgres.