It seems to me that this question will not be answered exactly, because it requires too complicated analysis and deep immersion in the details of our system.
We have distributed a network of sensors. Information collected in one database and processed further.
The current database design is to have one huge table split for a month. We try to keep it at the level of 1 billion (usually 600-800 million records), so the fill rate is 20-50 million records per day.
The database server is currently MS SQL 2008 R2, but we started in 2005 and were updated during the development of the project.
The table itself contains the fields SensorId, MessageTypeId, ReceiveDate, and Data. The current solution is to save the sensor data in the data field (binary, fixed length 16 bytes) with partial decoding of its type and save it in messageTypeId.
We have a different type of sending messages by sensors (current is about 200), and it can be increased on demand.
The main processing is performed on the application server, which retrieves the records on demand (by type, sensorId and date), decodes and performs the required processing. For so much data, the current speed is enough.
We have a request to increase the capacity of our system by 10-20 times, and we worry that our current solution is capable of this.
We also have two ideas for “optimizing” the structure that I want to discuss.
1 Sensor data can be divided into types, for simplicity I use 2 primary: (values) level data (analog data with a range of values), status data (a fixed number of values)
So, we can remake our table into a bunch of small ones using the following rules:
for each value of a fixed type (state type) creates its own table using SensorId and ReceiveDate (therefore we avoid the storage type and binary blob), all dependent (extended) states will be stored in our own table, similar to the Foreign Key, therefore, if we have there is State with values A and B , and for it (or additional) states for it 1 and 2 we end with the tables StateA_1 , StateA_2 , StateB_1 , StateB_2 . Thus, the name of the table consists of the fixed states that it represents.
for each analog information we create a separate table, it will be similar to the first type, but it creates an additional field with the sensor value;
Pros:
- Save only the required amount of data (at present, our binary data block contains space up to the longest value) and a reduced database size;
- To get data of a certain type, we get a table of permissions instead of a filter by type;
Minuses:
- AFAIK, it violates recommended practice;
- It requires the development of a framework for automating table management, as it will be a DBA add-on to support it manually;
- The number of tables can be significantly larger, since it requires full coverage of possible values;
- Changes to the DB schema when introducing new sensor data or even a new state value for already defined states may require complex changes;
- Sophisticated management leads to a penchant for errors;
- Perhaps it is a damn darn DB to insert values into such an orgranisation table?
- The structure of the database is not fixed (constantly changed);
Probably, all the disadvantages have several advantages, but if we get a significant increase in performance and / or (less preferable, but valuable) memory, perhaps we will follow this path.
2 Perhaps just divide the table by the sensor (this will be about 100,000 tables) or better by the range of sensors and / or go to different databases with dedicated servers, but we want to avoid the hardware range, if possible.
3 Leave as is.
4 Switching to various types of DBMS, for example. column-oriented DBMS (HBase, etc.).
What do you think? Maybe you can offer a resource for further reading?
Update: The nature of the system is that some of the data from the sensors can come even with a monthly delay (usually with a delay of 1-2 weeks), some of which are always online, some sensor has a memory on board and eventually comes out in Internet. Each sensor message has an associated event date and server receipt date, so we can distinguish recent data from collected data some time ago. Processing includes some statistical calculation, determination of steam deviation, etc. We created aggregated reports for quick viewing, but when we get data from sensors, we update old data (already processed), we must rebuild some reports from scratch, since they depend on all available data and aggregated values cannot be used. Thus, we usually save data for 3 months for quick access and other archived data. We try to reduce the data needed to store data, but we decided that we need all this so that the results are accurate.
Update2:
Here is a table with raw data. As I mention in the comments, we remove from it all the dependencies and restrictions during the "need for speed", so it is used only for storage.
CREATE TABLE [Messages]( [id] [bigint] IDENTITY(1,1) NOT NULL, [sourceId] [int] NOT NULL, [messageDate] [datetime] NOT NULL, [serverDate] [datetime] NOT NULL, [messageTypeId] [smallint] NOT NULL, [data] [binary](16) NOT NULL )
Example data from one of the servers:
id sourceId messageDate serverDate messageTypeId data 1591363304 54 2010-11-20 04:45:36.813 2010-11-20 04:45:39.813 257 0x00000000000000D2ED6F42DDA2F24100 1588602646 195 2010-11-19 10:07:21.247 2010-11-19 10:08:05.993 258 0x02C4ADFB080000CFD6AC00FBFBFBFB4D 1588607651 195 2010-11-19 10:09:43.150 2010-11-19 10:09:43.150 258 0x02E4AD1B280000CCD2A9001B1B1B1B77