How to store sets of objects that happened together during events?

Question

How to store sets of objects that happened together during events?

I'm looking for an efficient way to store sets of objects that happened together during events, so that I can generate aggregate statistics on them on a daily basis.

To make an example, imagine a system that tracks meetings in an office. For each meeting, we record how many minutes it was and in which room it happened.

I want to get statistics broken by both a person and a room. I don’t need to keep track of individual meetings (so there is no meeting_id or something like that), all I want to know is the daily summary. My real application has hundreds of thousands of events per day, so saving each of them is impossible.

I would like to answer questions such as:

In 2012, how many minutes did Bob, Sam and Julie spend in each conference room (not necessarily together)?

This is probably normal with 3 queries:

 >>> query(dates=2012, people=[Bob]) {Board-Room: 35, Auditorium: 279} >>> query(dates=2012, people=[Sam]) {Board-Room: 790, Auditorium: 277, Broom-Closet: 71} >>> query(dates=2012, people=[Julie]) {Board-Room: 190, Broom-Closet: 55}

In 2012, how many minutes did Sam and Julie have a MEETING TOGETHER in each conference room? What about Bob, Sam, and Julia?

 >>> query(dates=2012, people=[Sam, Julie]) {Board-Room: 128, Broom-Closet: 55} >>> query(dates=2012, people=[Bob, Sam, Julie]) {Board-Room: 22}

In 2012, how many minutes did each person spend in the Board-Room?

 >>> query(dates=2012, rooms=[Board-Room]) {Bob: 35, Sam: 790, Julie: 190}

In 2012, how many minutes were used in the Board-Room?

This is actually quite complicated, since the naive strategy of summing up the number of minutes spent by each person will lead to a serious recalculation. But we can probably solve this problem by storing the number separately as an Anyone meta-person:

 >>> query(dates=2012, rooms=[Board-Room], people=[Anyone]) 865

What are some good data structures or databases that I can use to include this kind of query? Since the rest of my application uses MySQL, I am tempted to define a string column containing the (sorted) identifiers of each person in the meeting, but the size of this table will grow quite quickly:

 2012-01-01 | "Bob" | "Board-Room" | 2 2012-01-01 | "Julie" | "Board-Room" | 4 2012-01-01 | "Sam" | "Board-Room" | 6 2012-01-01 | "Bob,Julie" | "Board-Room" | 2 2012-01-01 | "Bob,Sam" | "Board-Room" | 2 2012-01-01 | "Julie,Sam" | "Board-Room" | 3 2012-01-01 | "Bob,Julie,Sam" | "Board-Room" | 2 2012-01-01 | "Anyone" | "Board-Room" | 7

What else can I do?

+6

database data-structures database-design analytics

Rob crowell Aug 2 '13 at 17:03

source share

5 answers

Shawn · Answer 1 · 2013-08-06T15:01:30+0000

Your question is a bit unclear because you say you don’t want to keep each individual meeting, but then how do you get the current statistics for the meeting (date)? In addition, any table indicating the necessary indexes can be very fast even with a large number of records.

You should be able to use a table like log_meeting. I assume this may contain something like:

 employee_id, room_id, date (as timestamp), time_in_meeting

If foreign keys for the employee identifier for the employee table and room key in the room table

If you are indexing an employee id, room id and date, you should have a pretty quick search, since indexes with multiple mysql columns go from left to right, so you get an index (employee id, employee id id + room number and employee id + id number + time stamp) when a search is in progress. This is explained more in the multi-index part:

http://dev.mysql.com/doc/refman/5.0/en/mysql-indexes.html

Philippe grondier · Answer 2 · 2013-08-09T14:18:20+0000

By refusing to store collections (and related objects) individually, you lose the original source of information.

You will not be able to compensate for this loss of data if you do not regularly learn an exhaustive list of all potential daily (or monthly or weekly or ...) units that you may need later!

Believe me, it will be a nightmare ...

puneet · Answer 3 · 2013-08-11T15:55:00+0000

If the number of people is constant and not very large, you can assign a column to each person for the present or not and save the room, date and time in three columns, this can eliminate the problems with line splitting.

Also, by the nature of your question, I feel, first of all, you need to assign identifiers to all rooms, people, etc. No need to repeat a lot in the database. Also try reducing any row operation and working using separate data in each column for better intersection performance. You can also save the permutation of all people in the table and assign an identifier for them, and then use one of these identifiers in the actual date and time table. But all methods will require that something be permanent, either people or rooms.

Martin podval · Answer 4 · 2013-08-12T06:37:00+0000

I don’t understand if you know all the “questions” during development or you can add new ones during development / production - this approach will require permanent storage of all data.

Well, if you knew all your questions, it looks like a classic "banking system" that recounts data daily.

How i think about it.

It looks like you have a limited number of rooms, people, days, etc.
Collect registration data daily, one table a day. Only one event, one row of the database, all the information (field) that you need.
Start analyzing the data with some crown script at midnight.
Update statistics for people, rooms, etc. Just increase the number of hours Bob spent in xyz room, etc. Anything you need.
Since the analyzed data is limited and relatively small as it is analyzed (compressed), your system may also contain various queries, since the indexes will be relatively small, etc.

You can use the scalable map / reduce algorithm.

emperorz · Answer 5 · 2013-09-04T16:50:26+0000

You cannot avoid storing atomic facts as follows: (conference room, people, duration, day), which is probably only a weak consolidation, when the same people meet several times in the same room on the same day. Perhaps this is happening in your office :).

Group matching is comparable - an interesting problem, but as long as you always compose member strings the same way, you can probably do this with string comparisons. However, this is not "normal." To normalize, you will need a relationship table (many of many) and make a temporary table from your query set so that it quickly turns on or uses the "IN" clause and a set of counters to make sure everything is there (you will see what I mean when I try )

I think you can find out the minutes that the meeting room used, since the meetings should not overlap, so the amount will work.

For storage efficiency, use whole keys for everyone with lookup tables. Separate integers while parsing the query, or just use the good old joins if you feel traditional.

How I would do it :).

How to store sets of objects that happened together during events?

More articles: