Good (noSQL?) Database for physical measurements

We are building a measuring system, which will ultimately consist of thousands of measuring stations. Each station will save about 500 million measurements, consisting of 30 scalar values ​​over its lifetime. These will be the float values. Now we are interested in how to store this data at each station, given that we will create a web application at each station so that

  • we want to visualize data on several time scales (for example, measurements for one week, month, year).
  • we need to build moving averages from the data (for example, the monthly average to show the year graph)
  • the database must be robust (outages).
  • we only write and read, no updates or deletes according to the data

In addition, we need another server that can display data, for example, 1000 measuring stations. This will be ~ 50 TB of data in 500 billion measurements. To transfer data from the measuring station to the server, I thought that some type of replication at the database level would be a clean and efficient way.

Now I am wondering if the noSQL solution could be better than mySQL for this purpose. Especially couchDB , Cassandra and possibly key-value stores like Redis look attractive to me. In your opinion, which of them will suit the best data model "time series of measurements"? What about other benefits, such as crash safety and replication from the measuring station to the main server?

+6
source share
3 answers

I think CouchDB is a great database, but the ability to handle big data is dubious. CouchDB focuses on ease of development and offline replication, not necessarily performance or scalability. CouchDB itself does not support splitting, so you will be limited to the maximum node size unless you use BigCouch or invent your own splitting scheme.

No foolin, Redis is a database in memory. This is very fast and efficient when retrieving data from and from RAM. He has the ability to use the drive for storage, but this is not very good. This is great for a limited amount of data that changes frequently. Redis has replication, but no built-in support for partitioning, so you'll be here again.

You also mentioned Cassandra, which I think is more aimed at your use case. Cassandra is well suited for databases that grow endlessly, in fact, this is an original use case. Separation and availability are baked, so you don't have to worry about it. The data model is also a little more flexible than the average key / value store, adding a second column size and can practically hold millions of columns per row. This allows, for example, to "row" the time series data into rows that span time ranges, for example. Cluster data distribution (partitioning) is performed at the row level, so only one node is required to perform operations within a row.

Hadoop connects directly to Cassandra with "native drivers" for MapReduce, Pig, and Hive, so it can potentially be used to aggregate collected data and materialize current averages. It is best practice to form data around queries, so it is probably necessary to save multiple copies of the data in a “denormalized” form, one for each type of query.

Send this message to complete the time series in Kassandra:

http://rubyscale.com/2011/basic-time-series-with-cassandra/

+2
source

For highly structured data of this nature (time series of float vectors) I try to evade databases. Most database functions are not very interesting; you are mostly not interested in things like atomicity or transactional semantics. The only feature that is desired is failure tolerance. However, this function is trivially easy to implement when you do not need to cancel the recording (no updates / deletes), just adding to the file. disaster recovery is simple; open a new file with the serial number added in the file name.

The logical format for this is a simple CSV. after each measurement, call flush() in the base file . Getting data replication to a central server is the work that rsync(1) efficiently solve. You can then import the data into an analysis tool of your choice.

+2
source

I would step back from the csv and plaintext files. They are useful when you have a low volume level and want to skip tools to quickly view data or make small changes to the data.

When you talk about “50Tb” of data, that’s pretty much. If a simple trick halves this, it will return itself to storage costs and bandwidth charges.

If measurements are taken on a regular basis, this means that instead of saving a timestamp for each measurement, you save the start time and interval and just save the measurements.

I would choose a file format that has a small header, and then just a bunch of floating point measurements. To make the files really large, select the maximum file size. If you initialize the file, completely recording it before you start using the file, it will be completely allocated on the disk by the time you start using it. Now you can mmap file and change the data. If the power changes when data changes, it simply either makes it to disk or not.

0
source

Source: https://habr.com/ru/post/900601/


All Articles