Data Warehouse Design for NxN Aggregation

I am trying to find a theoretical solution to the NxN problem for data aggregation and storage. As an example, I have a huge amount of data that comes through a stream. The stream sends data in points. Each point has 5 dimensions:

  • Location
  • the date
  • Time
  • Name
  • Statistics

Then this data should be aggregated and stored in order to allow another user to come and request data both by location and by time. The user should be able to request the following (pseudo-code):

Show me the aggregated location statistics 1,2,3,4, .... N between the dates 01/01/2011 and 01/03/2011 between 11:00 and 16:00

Unfortunately, due to the scale of the data, it is impossible to collect all this data from points on the fly, and therefore aggregation must be done before that. As you can see, although there are several dimensions to which data could be aggregated.

They can request any number of days or places, so a huge preliminary aggregation will be required to search for all combinations:

  • Record for locations 1 Today
  • Record for locations 1.2 Today
  • Record for locations 1.3 Today
  • Record for locations 1,2,3 Today
  • etc. to N

Pre-processing all of these combinations before requesting can result in a viable number of precessions. If we have 200 different places, then we have 2 ^ 200 combinations that it would be almost impossible to precompute for reasonable periods of time.

I was thinking about creating records for 1 dimension, and then the merge could be done on the fly on demand, but it also took time to scale.

Questions:

  • How do I choose the right measurement and / or combination of sizes, given that the user can request all measurements?
  • Are there any case studies that I could refer to, books that I could read, or anything else that you might think would help?

Thank you for your time.

EDIT 1

When I say that combining data together, I mean combining statistics and name (sizes 4 and 5) for other dimensions. So, for example, if I request data for locations 1,2,3,4..N, then I have to combine statistics and name counting for those N locations before serving it to the user.

Similarly, if I request data for dates 01/01/2015 - 01/12/2015, then I have to combine all the data between these periods (by adding a summing name / statistics).

Finally, if I request data between the dates 01/01/2015 - 01/12/2015 for locations 1,2,3,4..N, then I must combine all the data between these dates for all these locations.

For this example, we can say that for passing statistics it takes some kind of nested loop and does not scale very well, especially on the fly.

+5
source share
7 answers

Denormalization is a means to solve the performance or scalability problem in a relational database.
IMO will help you create some new tables for storing aggregated data and use them for reporting.

I have a huge amount of data that comes through the stream. The stream sends data in points.

There will be several ways to achieve denormalization in case of:

  • Adding a new parallel endpoint for the data aggregation function at the streaming level
  • Task planning for aggregating data at the DBMS level.
  • Using the DBMS launch mechanism (less efficient)

In the ideal case, when the message reaches the streaming level, there will be two copies of the data message containing location, date, time, name, statistics measurements sent for processing, one goes to OLTP (current application logic), the second goes to OLAP (BI) process.
The BI process will create denormalized aggregated reporting structures.
I will offer a combined record of data by location, group of dates.

Thus, the end user will request pre-verified data that does not require serious recounts, with some acceptable error.

How can I choose the right size and / or combination of measurements if the user is likely to complete a request for all measurements?

It depends on your application logic. If possible, restrict the user to predefined queries to which values ​​can be assigned by the user (for example, for dates from 01/01/2015 to 12/01/2015). On more complex systems, you can use the report generator above the BI repository.
I would recommend Kimball The Data Warehouse ETL Toolkit .

+1
source

Try the time series database!

From your description, it seems that your data is a set of time series . It seems that the user is mainly concerned about the time of the request, and after choosing a time interval, the user will refine the results under additional conditions.

With this in mind, I suggest you try a time series database such as InfluxDB or OpenTSD . For example, Influx provides a query language that can handle queries such as the following, which is pretty close to what you are trying to achieve:

 SELECT count(location) FROM events WHERE time > '2013-08-12 22:32:01.232' AND time < '2013-08-13' GROUP BY time(10m); 

I'm not sure what you mean by a scale, but time series databases were designed to quickly load multiple data points. I suggest definitely giving them a try before laying out their own solution!

+2
source

You can at least reduce the date and time to one measurement and pre-aggregate your data based on your minimum level of detail, for example. 1 second or 1 minute resolution. It may be useful to cache and block the incoming stream for the same resolution, for example. Add totals to the data warehouse every second instead of updating for each point.

What is the size and probability of changing the domain name and location? Is there a connection between them? You said that the location can be as many as 200. I think that if the name is a very small set and is unlikely to change, you can count the names in the columns with names in one record, reducing the scale of the table to 1 row per place per unit of time.

+1
source

you have a lot of data. This will take a long time with all the methods due to the amount of data you are trying to analyze. I have two methods. The first one is cruel, you probably thought:

 id | location | date | time | name | statistics 0 | blablabl | blab | blbl | blab | blablablab 1 | blablabl | blab | blbl | blab | blablablab ect. 

With this, you can easily parse and retrieve elements, all of them are in the same table, but the parsing is long and the table is huge.

The second is better, I think:

 Multiple tables: id | location 0 | blablabl id | date 0 | blab id | time 0 | blab id | name 0 | blab id | statistics 0 | blablablab 

With this, you could analyze (a lot) faster, getting the identifiers, and then taking all the necessary information. It also allows you to prepare all the data: you can sort places by location, time sorted by time, name sorted alphabetically, etc., because we don’t care about mixing the identifier: If the identifier is 1 2 3 or 1 3 2 , nobody really cares, and you will go much faster with parsing if your data has already been analyzed in the corresponding tables.

So, if you use the second method that I gave: At that moment, when you get the data point, give an identifier to each of your columns:

 You receive: London 12/12/12 02:23:32 donut verygoodstatsblablabla You add the ID to each part of this and go parse them in their respective columns: 42 | London ==> goes with London location in the location table 42 | 12/12/12 ==> goes with 12/12/12 dates in the date table 42 | ... 

With this, you want to get all the London data, they are all side by side, you just need to take all the identifiers and get other data with them. If you want to take all the data between 11/11/11 and 12/12/12, they are all side by side, you just need to take eids, etc.

Hope I helped, sorry for my poor English.

0
source

You should check out Apache Flume and Hadoop http://hortonworks.com/hadoop/flume/#tutorials

A pulmonary agent can be used to collect and aggregate data in HDFS, and you can scale it as needed. Once in HDFS there are many possibilities to visualize and even use map search or elastic search to view the datasets that you are looking for in the above examples.

0
source

I worked with a database with thousands of products and ten thousand stores (usually at the level of aggregate sales at the weekly level, as well as at the level of receipts for basket analysis, cross-selling, etc.). I would suggest you look at them:

  • Amazon Redshift , very scalable and relatively easy to start, cost-effective
  • Microsoft Columnstore Indexes , which compresses data and has a familiar SQL interface, is quite expensive (1 year a reserved instance of r3.2xlarge with AWS is about $ 37,000), there is no experience in how it scales in a cluster
  • ElasticSearch is my personal favorite, very scalable, very effective search using inverted indexes, a good aggregation structure , without license fees, has its own query language, but simple queries are simple to express.

In my experiments, ElasticSearch was faster than Microsoft column stores or clustered index tables for small to medium queries by 20-50% on the same hardware. To have a fast response time, you must have enough RAM to create the necessary data structures loaded into memory.

I know that I miss many of the other database modules and platforms, but I am most familiar with them. I also used Apache Spark, but not in the context of data aggregation, but to train a distributed mathematical model.

0
source

Is it possible that this will be a way to do this without rough forcing?

I am only familiar with relational databases, and I think the only real way to handle this is with a flat table, as suggested before all your data points become fields in one table. I think you just need to decide how to do this and how to optimize it.

If you don’t need to maintain 100% of the only recording accuracy, then I think that the question really should be that we can throw it away.

I think my approach would be this:

  • Determine what the smallest piece of time will be, and quantify the time domain on it. for example, each analyzed record is 15 minutes.
  • Gather the raw records together into the raw table as they arrive, but as you go through the quantization window, summarize the rows in the analytic table (for a 15-minute window).
  • Removing old raw records can be done using a less time sensitive procedure.
  • The location looks like a limited set, so use a table to convert them to integers.
  • Specify all columns in the pivot table.
  • Fulfill requests.

Obviously, I bet that quantizing the time domain in this way is acceptable. You could provide an interactive scan by also accessing raw data on a temporary domain, but that would still be slow.

Hope this helps.

Mark

0
source

Source: https://habr.com/ru/post/1232234/


All Articles