How to include missing data for multiple groupings over a period of time?

Question

How to include missing data for multiple groupings over a period of time?

I have below a link query that studies groups, teacher, year of study and room for the past 12 months (including the current month). The result I get is correct, however I would like to include rows with zero counts when no data is available.

I looked at several other related posts, but could not get the desired result:

Postgres - how to return rows with 0 counter for missing data?
Month of the month postgresql with missing values
Best way to count records at arbitrary time intervals in Rails + Postgres

Here is the request:

SELECT upper(trim(t.full_name)) AS teacher , date_trunc('month', s.study_dt)::date AS study_month , r.room_code AS room , COUNT(1) AS study_count FROM studies AS s LEFT OUTER JOIN rooms AS r ON r.id = s.room_id LEFT OUTER JOIN teacher_contacts AS tc ON tc.id = s.teacher_contact_id LEFT OUTER JOIN teachers AS t ON t.id = tc.teacher_id WHERE s.study_dt BETWEEN now() - interval '13 month' AND now() AND s.study_dt IS NOT NULL GROUP BY teacher , study_month , room ORDER BY teacher , study_month , room;

The output I get is:

 "teacher","study_month","room","study_count" "DOE, JOHN","2015-07-01","A1",1 "DOE, JOHN","2015-12-01","A2",1 "DOE, JOHN","2016-01-01","B1",1 "SIMPSON, HOMER","2016-05-01","B2",3 "MOUSE, MICKEY","2015-08-01","A2",1 "MOUSE, MICKEY","2015-11-01","B1",1 "MOUSE, MICKEY","2015-11-01","B2",2

But I want the number 0 to be displayed for all missing year-month and room combinations. For example (only the first lines, only 4 rooms: A1, A2, B1, B2):

 "teacher","study_month","room","study_count" "DOE, JOHN","2015-07-01","A1",1 "DOE, JOHN","2015-07-01","A2",0 "DOE, JOHN","2015-07-01","B1",0 "DOE, JOHN","2015-07-01","B2",0 ... "DOE, JOHN","2015-12-01","A1",1 "DOE, JOHN","2015-12-01","A2",0 "DOE, JOHN","2015-12-01","B1",0 "DOE, JOHN","2015-12-01","B2",0 ...

To get the missing months, I tried the left outer join using time series and joining time_range.year_month = study_month , but that didn't work.

 SELECT date_trunc('month', time_range)::date AS year_month FROM generate_series(now() - interval '13 month', now() ,'1 month') AS time_range

So, I would like to know how to “fill in the blanks” for

a) both year-month and number and, as a bonus: b) only year-month.

The reason for this is that the data set will be transferred to the composite library so that we can get a result similar to the following (could not do it directly in PG):

 teacher,room,2015-07,...,2015-12,...,2016-07,total "DOE, JOHN",A1,1,...,1,...,0,2 "DOE, JOHN",A2,0,...,0,...,0,0 ...and so on...

+1

sql aggregate-functions group-by left-join postgresql

zam6ak Jul 12 '16 at 14:57

source share

2 answers

You need to generate all rows using cross join and then join studies and perform aggregation to get the score.

The received request should look like this:

 select t.teacher, d.mon, r.room_code, count(s.teacher_contact_id) from teachers t cross join rooms r cross join generate_series(date_trunc('month', now() - interval '13 month', date_trunc('month', now()), interval '1 month' ) d(mon) left join (select distinct date_trunc('month', s.study_dt)::date as mon) d left join teacher_contacts tc on tc.teacher_id = t.id left join studies s on tc.id = s.teacher_contact_id and date_trunc('month', s.study_dt) = d.mon group by t.teacher, d.mon, r.room_code;

0

Gordon Linoff Jul 12 '16 at 15:06

source share

Erwin Brandstetter · Accepted Answer · 2016-07-12 15:46

Based on some assumptions (ambiguities in the question), I suggest:

 SELECT upper(trim(t.full_name)) AS teacher , m.study_month , r.room_code AS room , count(s.room_id) AS study_count FROM teachers t CROSS JOIN generate_series(date_trunc('month', now() - interval '12 month') -- 12! , date_trunc('month', now()) , interval '1 month') m(study_month) CROSS JOIN rooms r LEFT JOIN ( -- parentheses! studies s JOIN teacher_contacts tc ON tc.id = s.teacher_contact_id -- INNER JOIN! ) ON tc.teacher_id = t.id AND s.study_dt >= m.study_month AND s.study_dt < m.study_month + interval '1 month' -- sargable! AND s.room_id = r.id GROUP BY t.id, m.study_month, r.id -- id is PK of respective tables ORDER BY t.id, m.study_month, r.id;

Highlights

Grid all your desired combinations with CROSS JOIN . And then LEFT JOIN for existing rows. Connected:
- array_agg group by and null
- Get created as well as deleted records of last week
In your case, this is a join of several tables, so I use parentheses in the FROM list before LEFT JOIN to the result of INNER JOIN in parentheses. This would not be true LEFT JOIN for each table separately, because you would include hits for partial matches and get potentially incorrect counts.
Assuming referential integrity and working with PK columns directly, we do not need to include rooms and teachers on the left side a second time. But we still have a join between the two tables ( studies and teacher_contacts ). I don’t understand the role of teacher_contacts . Normally, I would expect a direct connection between studies and teachers . Could be even easier ...
We need to count a nonzero column on the left side to get the desired values. Like count(s.room_id)
To save this quickly for large tables, make sure your predicates are sargable . And add the appropriate indexes .
The teacher column is unlikely (reliably) unique. Use a unique identifier, preferably a PC (faster and easier). I am still using teacher to output according to your desired result. It might be wise to include a unique identifier, as the names may be duplicate.
Do you want to:
for the last 12 months (including the current month).
So, start with date_trunc('month', now() - interval '12 month' (not 13). This rounding of the beginning already does what you want - more precisely than the original query.

Since you mentioned slow performance, depending on the actual definitions of the tables and the distribution of the data, it is most likely faster than aggregate first and join later, as in this related answer:

Postgres - how to return rows with 0 count for missing data?

 SELECT upper(trim(t.full_name)) AS teacher , m.mon AS study_month , r.room_code AS room , COALESCE(s.ct, 0) AS study_count FROM teachers t CROSS JOIN generate_series(date_trunc('month', now() - interval '12 month') -- 12! , date_trunc('month', now()) , interval '1 month') mon CROSS JOIN rooms r LEFT JOIN ( -- parentheses! SELECT tc.teacher_id, date_trunc('month', s.study_dt) AS mon, s.room_id, count(*) AS ct FROM studies s JOIN teacher_contacts tc ON s.teacher_contact_id = tc.id WHERE s.study_dt >= date_trunc('month', now() - interval '12 month') -- sargable GROUP BY 1, 2, 3 ) s ON s.teacher_id = t.id AND s.mon = m.mon AND s.room_id = r.id ORDER BY 1, 2, 3;

About your final remark:

the data set will be transferred to the composite library ... (could not do it directly in PG)

Most likely, you can use the two-parameter crosstab() form to get the desired result directly and with excellent performance, and this query is not needed to start. Consider:

PostgreSQL Cross Forward Request

How to include missing data for multiple groupings over a period of time?

Highlights

More articles: