Database design with periodic sensor data

I am developing a PostgreSQL database that reads data from many sensor sources. I have done a lot of research in design, and I am looking for fresh material that will help me get rid of ruts.

To be clear, I'm not looking for help describing data sources or any related metadata. I'm specifically trying to figure out how best to store data (ultimately, of different types).

The basic structure of the incoming data is as follows:

  • There are several channels for each data logging device.
  • For each channel, the recorder reads the data and binds it to a time-stamped record.
  • Different channels can have different data types, but as a rule, float4 is enough.
  • Users should (through database functions) be able to add different types of values, but this problem is secondary.
  • Recorders and channels will also be added through features.

A distinctive feature of this data layout is that I have many channels linking data points with one record with a timestamp and index number.

Now, to describe the amount of data and common access patterns:

  • Data will arrive at approximately 5 registrars, each with 48 channels, every minute.
    • The total amount of data in this case will be 345,600 samples per day, 126 million per year, and these data should be constantly read for at least the next 10 years.
  • In the future, additional recorders and channels will be added, possibly from physically different types of devices, but, hopefully, with a similar representation of the storage.
  • Sharing will include requesting similar types of channels for all registrars and combining between registration timestamps. For example, get channel1 from logger1, channel4 from logger2 and do a full outer join at logger1.time = logger2.time.

I should also note that each timestamp of the registrar is something that can be changed in connection with the time setting, and will be described in another table showing the server read time, recorder read time, transmission latency, clock adjustment and the adjusted clock value. This will happen for a set of log entries / timestamps depending on the search. This is my motivation for RecordTable below, but otherwise it doesn't cause much concern as long as I can reference a line (logger, time, record) somewhere that will change the timestamps for the associated data.

I looked at quite a few options for the scheme, the simplest, similar to the hybrid EAV approach, where the table itself describes this attribute, since most of the attributes will be just a real value called a "value". Here is the basic layout:

 RecordTable DataValueTable ---------- -------------- [PK] id <-- [FK] record_id [FK] logger_id [FK] channel_id record_number value logger_time 

Given that logger_id , record_number and logger_time unique, I suppose I am using surrogate keys here, but hopefully my rationale for saving space makes sense here. I also considered adding the PK identifier to the DataValueTable (rather than the PK being record_id and channel_id ) to refer to data values ​​from other tables, but I'm trying to resist the urge to make this model “too flexible” for now. However, I want to start getting data quickly and not change this part when additional functions or data with different structures should be added later.

First, I created the recording tables for each registrar, and then evaluated the tables for each channel and described them in another place (in one place), with views to connect them all, but it just was "wrong" because I repeated the same subject many times. I’m probably trying to find a happy environment between too many tables and too many rows, but splitting big data ( DataValueTable ) seems strange, because I will most likely split into channel_id , so each section will have the same meaning for each strings. In addition, splitting in this regard will require a bit of work to redefine the verification conditions in the main table each time a channel is added. Separation by date only applies to RecordTable , which is not really necessary, considering how small it will be (7200 lines per day with 5 registrars).

I also considered using the above with partial indexes on channel_id , since the DataValueTable will grow very large, but the set of channel identifiers will remain small, but I'm really not sure if it will scale after many years, I did some basic testing with mock data, and performance only so-so, and I want it to remain exceptional as data grows. In addition, some are concerned about evacuation and analysis of a large table and deal with a large number of indices (up to 250 in this case).

On the very small side, notice, I will also track changes in this data and allow annotations (for example, a bird trimmed on the sensor, so these values ​​have been adjusted / marked, etc.), so keep this in the back of yours if you take into account the design here but now this is a separate problem.

Some information about my experience / technical level, if it helps to find out where I come from: I am a PhD PhD student, and I regularly work with data / databases as part of my research. However, my practical experience in developing a reliable database for clients (this is part of the business), which has exceptional durability and flexible presentation of data, is somewhat limited. I think that now my main problem is that I consider all approaches to this problem, instead of focusing on its implementation, and I do not see the “right” solution in front of me at all.

So, in conclusion, I think these are my main requests for you: if you did something like this, what worked for you? What are the advantages / disadvantages that I do not see in the various projects that I have proposed here? How can you create something like this given these parameters and access patterns?

I will be happy to provide clarifications / details where necessary, and in advance in advance for being awesome.

+4
source share
2 answers

It is not a problem to provide all this in a relational database. PostgreSQL is not an enterprise class, but it is certainly one of the best free SQL queries.

To be clear, I'm not looking for help describing data sources or any related metadata. I'm specifically trying to figure out how best to store data (ultimately, of different types).

This is your biggest obstacle. Unlike developing a program that allows you to decompose and isolate the analysis / design of components, databases should be developed as a whole. Normalization and other design methods must consider both the whole and the component in context. Data, descriptions, metadata should be evaluated together, and not as separate parts.

Secondly, when you start with surrogate keys, implying that you know the data and how it relates to other data, this prevents genuine data modeling.

I answered a very similar set of questions, which coincidentally received very similar data. If you could read these answers first, it will save us a lot of time for your question / answer.

Answer one / obstacle ID
The answer to two / Home
Reply Three / Historical

+2
source

I did something similar with seismic data for oil exploration.

My suggestion was to store metadata in a database and store sensor data in flat files, whatever that means for your computer operating system.

You will need to write your own access procedures if you want to change the sensor data. In fact, you should never change the sensor data. You must make a copy of the sensor data with the changes so that later you can show what changes have been made to the sensor data.

0
source

Source: https://habr.com/ru/post/1342702/


All Articles