Huge dataset database design recommendations

It seems to me that this question will not be answered exactly, because it requires too complicated analysis and deep immersion in the details of our system.

We have distributed a network of sensors. Information collected in one database and processed further.

The current database design is to have one huge table split for a month. We try to keep it at the level of 1 billion (usually 600-800 million records), so the fill rate is 20-50 million records per day.

The database server is currently MS SQL 2008 R2, but we started in 2005 and were updated during the development of the project.

The table itself contains the fields SensorId, MessageTypeId, ReceiveDate, and Data. The current solution is to save the sensor data in the data field (binary, fixed length 16 bytes) with partial decoding of its type and save it in messageTypeId.

We have a different type of sending messages by sensors (current is about 200), and it can be increased on demand.

The main processing is performed on the application server, which retrieves the records on demand (by type, sensorId and date), decodes and performs the required processing. For so much data, the current speed is enough.

We have a request to increase the capacity of our system by 10-20 times, and we worry that our current solution is capable of this.

We also have two ideas for “optimizing” the structure that I want to discuss.

1 Sensor data can be divided into types, for simplicity I use 2 primary: (values) level data (analog data with a range of values), status data (a fixed number of values)

So, we can remake our table into a bunch of small ones using the following rules:

  • for each value of a fixed type (state type) creates its own table using SensorId and ReceiveDate (therefore we avoid the storage type and binary blob), all dependent (extended) states will be stored in our own table, similar to the Foreign Key, therefore, if we have there is State with values A and B , and for it (or additional) states for it 1 and 2 we end with the tables StateA_1 , StateA_2 , StateB_1 , StateB_2 . Thus, the name of the table consists of the fixed states that it represents.

  • for each analog information we create a separate table, it will be similar to the first type, but it creates an additional field with the sensor value;

Pros:

  • Save only the required amount of data (at present, our binary data block contains space up to the longest value) and a reduced database size;
  • To get data of a certain type, we get a table of permissions instead of a filter by type;

Minuses:

  • AFAIK, it violates recommended practice;
  • It requires the development of a framework for automating table management, as it will be a DBA add-on to support it manually;
  • The number of tables can be significantly larger, since it requires full coverage of possible values;
  • Changes to the DB schema when introducing new sensor data or even a new state value for already defined states may require complex changes;
  • Sophisticated management leads to a penchant for errors;
  • Perhaps it is a damn darn DB to insert values ​​into such an orgranisation table?
  • The structure of the database is not fixed (constantly changed);

Probably, all the disadvantages have several advantages, but if we get a significant increase in performance and / or (less preferable, but valuable) memory, perhaps we will follow this path.

2 Perhaps just divide the table by the sensor (this will be about 100,000 tables) or better by the range of sensors and / or go to different databases with dedicated servers, but we want to avoid the hardware range, if possible.

3 Leave as is.

4 Switching to various types of DBMS, for example. column-oriented DBMS (HBase, etc.).

What do you think? Maybe you can offer a resource for further reading?

Update: The nature of the system is that some of the data from the sensors can come even with a monthly delay (usually with a delay of 1-2 weeks), some of which are always online, some sensor has a memory on board and eventually comes out in Internet. Each sensor message has an associated event date and server receipt date, so we can distinguish recent data from collected data some time ago. Processing includes some statistical calculation, determination of steam deviation, etc. We created aggregated reports for quick viewing, but when we get data from sensors, we update old data (already processed), we must rebuild some reports from scratch, since they depend on all available data and aggregated values ​​cannot be used. Thus, we usually save data for 3 months for quick access and other archived data. We try to reduce the data needed to store data, but we decided that we need all this so that the results are accurate.

Update2:

Here is a table with raw data. As I mention in the comments, we remove from it all the dependencies and restrictions during the "need for speed", so it is used only for storage.

 CREATE TABLE [Messages]( [id] [bigint] IDENTITY(1,1) NOT NULL, [sourceId] [int] NOT NULL, [messageDate] [datetime] NOT NULL, [serverDate] [datetime] NOT NULL, [messageTypeId] [smallint] NOT NULL, [data] [binary](16) NOT NULL ) 

Example data from one of the servers:

 id sourceId messageDate serverDate messageTypeId data 1591363304 54 2010-11-20 04:45:36.813 2010-11-20 04:45:39.813 257 0x00000000000000D2ED6F42DDA2F24100 1588602646 195 2010-11-19 10:07:21.247 2010-11-19 10:08:05.993 258 0x02C4ADFB080000CFD6AC00FBFBFBFB4D 1588607651 195 2010-11-19 10:09:43.150 2010-11-19 10:09:43.150 258 0x02E4AD1B280000CCD2A9001B1B1B1B77 
+4
source share
4 answers

Just going to drop some ideas there, hope they are useful - these are some of the things that I will think / think / research about.

Separation - you indicate that the table is divided into months. Is it shared manually, or are you using the sharing functionality available in Enterprise Edition? As a user guide, consider using the built-in partitioning features to break up your data more, which should increase scalability / performance. This is the " " Divided Tables and Indexes "article on Kimberly Tripp's MSDN - great - a lot of great information, I won’t do it unfairly rephrasing! Keep in mind that this manually creates 1 table per sensor, which can be more difficult to maintain / implement and, hence added complexity (simple = good), of course, only if you have Enterprise Edition.

Filtered Indexes - View This MSDN Article

There is, of course, a hardware element - it goes without saying that a rebellious server with heaps of RAM / fast disks, etc. will play a role.

+1
source

One of the non-database methods is to switch to recording changes in values ​​- with a minimum number of records per minute or so. So, for example, if sensor # 1 sends something like:

 Id Date Value ----------------------------- 1 2010-10-12 11:15:00 100 1 2010-10-12 11:15:02 100 1 2010-10-12 11:15:03 100 1 2010-10-12 11:15:04 105 

then only the first and last record will end in the database. To ensure that the sensor is live, a minimum of 3 entries are entered per minute. Thus, the amount of data will be reduced.

Not sure if this will help, or if it will be possible in your application - just an idea.

EDIT

Can I archive data based on access probability? Would it be correct to say that old data is less likely to be available than new data? If so, you can take a look at the Bill Inmon DW 2.0 architecture for the next generation of data warehouses, where he discusses a model for moving data across different DW zones (interactive, integrated, near, archive) based on access probability. Access time varies from very fast (interactive zone) to very slow (archive). Each zone has different hardware requirements. The goal is to prevent large amounts of data clogging the DW.

+1
source

Storage - you'll probably be fine. SQL Server will process it.

What bothers me is loading your server. If you receive transactions continuously, you will have about 400 transactions per second today. Increase this by 20 times, and you see ~ 8,000 transactions per second. This is not a small number, given that you are reporting on the same data ...

By the way, did you correctly understand that you are dropping sensor data when you processed it? So your total dataset will be a "rolling" 1 billion rows? Or are you just adding data?

+1
source

You can store datetime stamps as integers. I believe datetime stamps use 8 bytes, and integers use only 4 in SQL. You will have to leave from the year, but since you partition by month, this may not be a problem.

So, '12 / 25/2010 23:22:59 'will be stored as 1225232259 -MMDDHHMMSS

Just a thought ...

0
source

Source: https://habr.com/ru/post/1333015/


All Articles