Database design for obtaining data with a high sampling rate, graphics at several zoom levels

Question

Database design for obtaining data with a high sampling rate, graphics at several zoom levels

I have several sensors that feed data to my web application. Each channel has 5 samples per second, and data is loaded together in 1 minute of json messages (contains 300 samples). Data will be displayed using flot at several zoom levels from 1 day to 1 minute.

I am using Amazon SimpleDB and I currently store the data in the 1 minute pieces that I get. This works well for high zoom levels, but for full days there will simply be too many lines to retrieve.

The idea I got at the moment is that I can scan data every hour and collect 300 samples together over the last hour and store them in another table, essentially reducing the data sample.

Does this sound like a smart decision? How do others implement the same systems?

+4

web-applications amazon-simpledb database-design charts

Tim Jan 16 '11 at 19:46

source share

4 answers

Saving data with downsampling is a great approach. See how munin stores his charts - daily, monthly, early and intraday charts are stored separately there.

You can store data for every minute, every 5 minutes, every hour, every 4 hours, every day in different tables. The overhead is very small compared to the fact that they are stored every minute with a big advantage, because you do not transfer what you do not need.

+5

Barsmonster Jan 19 '11 at 13:50

source share

Speed up the database, use the direct organization model. This is the fastest way to store / retrieve data from files. The implementation is just as simple that you do not need any framework or library.

Method:

you need to create algorhytm that converts the key to a continuous number of records (0..max the number of records),
you need to use a fixed recording size,
data is stored in flat files, where the position of the record in the file is rec. no. (based on the key, as described in 1.) multiplied by the record size (see 2.).

Own data

You can create one data file per day for easy maintenance. Then your key is not. sample during the day. Thus, your daily file will be 18000 * 24 * in size. You must first create this file with 0s in order to simplify the life of the operating system (maybe this helps a little, it depends on the basic file system / caching mechanism).

So, when the data arrives, calculate the position of the file and insert the record in its place.

Generalized data

You must also store summary data in direct files. These files will be much smaller. In the case of 1-minute summed values, it will have 24 * 60 * 60 entries.

There are several decisions you must make:

zoom stepping
step-by-step summarized data (it is not necessary to collect generalized data for each step of scaling),
organization of generalized databases (your own data can be stored in daily files, but daily data should be stored in monthly files).

Another thing is to think about the time it took to create the aggregated data. Although native data should be stored exactly as the data arrives, aggregated data can be calculated at any time:

upon receipt of the source data (in this case, the data of the first time is updated 300 times, which is not optimal for recording to disk immediately, the summation must be performed in memory);
The background job should periodically process its own data,
these amounts must be created in a lazy way, upon request.

Remember that not many years ago, these problems were problems with database design. I can promise one thing: it will be faster, faster than anything (other than using memory to store data).

+2

ern0 Jan 25 '11 at 10:17

source share

Amazon CloudWatch has been using custom metrics for some days. If monitoring and anxiety are your primary concern, this can be helpful.

0

Leen toelen May 12, '11 at 13:58

source share

Daniel · Accepted Answer · 2011-01-23T19:03:30+0000

I implemented this some time ago with downsampling "on the fly" for some graphs. The downside is that an adult loses permission, but I think this is acceptable to you. And if you are interested in peaks, you can specify the values max, avg and min.

The algorithm is also not too complicated. If you have 5 samples per second and want to maintain this granularity, you may need to store 5 * 60 * 60 = 18000 samples for that hour for an hour.

During the day, you can go down to 1 time every 5 seconds, reducing the amount of 25 times. Then the algorithm will be executed every 5 seconds and calculate the average value, min and a maximum of 5 seconds, which has passed 24 several hours ago. Results in 12 * 60 * 23 = 16560 more samples per day and if you store

Further back, I recommend a sample every minute, reducing the number by 12, possibly by two weeks, so you have 60 * 24 * 13 = 18720 more samples in two weeks of data.

Particular attention should be paid to storing data in the database. To get maximum performance, you need to make sure that the data of one sensor is stored in one block in the database. If you use, for example, PostgreSQL, you know that one block has a length of 8192 bytes, and two records are not stored in one block. Assuming that one sample has a length of 4 bytes, and given the overhead per row, I could add 2048 minus several samples in one block. At maximum resolution, this is 2040/5/60 = 6 minutes of data. It might be a good idea now to always add 6 minutes at a time, maybe 5, to just be mannequins for updating at later minutes, so requests from blocks of a single sensor can get faster.

At least I would use different tables for varying degrees of detail of the sensor.

Database design for obtaining data with a high sampling rate, graphics at several zoom levels

More articles: