I am trying to find a theoretical solution to the NxN problem for data aggregation and storage. As an example, I have a huge amount of data that comes through a stream. The stream sends data in points. Each point has 5 dimensions:
- Location
- the date
- Time
- Name
- Statistics
Then this data should be aggregated and stored in order to allow another user to come and request data both by location and by time. The user should be able to request the following (pseudo-code):
Show me the aggregated location statistics 1,2,3,4, .... N between the dates 01/01/2011 and 01/03/2011 between 11:00 and 16:00
Unfortunately, due to the scale of the data, it is impossible to collect all this data from points on the fly, and therefore aggregation must be done before that. As you can see, although there are several dimensions to which data could be aggregated.
They can request any number of days or places, so a huge preliminary aggregation will be required to search for all combinations:
- Record for locations 1 Today
- Record for locations 1.2 Today
- Record for locations 1.3 Today
- Record for locations 1,2,3 Today
- etc. to N
Pre-processing all of these combinations before requesting can result in a viable number of precessions. If we have 200 different places, then we have 2 ^ 200 combinations that it would be almost impossible to precompute for reasonable periods of time.
I was thinking about creating records for 1 dimension, and then the merge could be done on the fly on demand, but it also took time to scale.
Questions:
- How do I choose the right measurement and / or combination of sizes, given that the user can request all measurements?
- Are there any case studies that I could refer to, books that I could read, or anything else that you might think would help?
Thank you for your time.
EDIT 1
When I say that combining data together, I mean combining statistics and name (sizes 4 and 5) for other dimensions. So, for example, if I request data for locations 1,2,3,4..N, then I have to combine statistics and name counting for those N locations before serving it to the user.
Similarly, if I request data for dates 01/01/2015 - 01/12/2015, then I have to combine all the data between these periods (by adding a summing name / statistics).
Finally, if I request data between the dates 01/01/2015 - 01/12/2015 for locations 1,2,3,4..N, then I must combine all the data between these dates for all these locations.
For this example, we can say that for passing statistics it takes some kind of nested loop and does not scale very well, especially on the fly.