What would be the most efficient method for storing / updating Interval-based data in SQL?

I have a database table containing about 700 million rows plus (growing exponentially) temporary data.

Fields:

PK.ID, PK.TimeStamp, Value 

I also have 3 other tables that group this data in Days, Months, Years, which contain the sum of the values ​​for each ID for this period of time. These tables are updated at night with an SQL task, a situation arose when the tables need to be updated on the fly, when the data in the base table is updated, it can, however, be up to 2.5 million rows at a time (not very often, like usually around 200-500 thousand up to every 5 minutes), is this possible without causing massive performance losses or what would be the best method to achieve this?


NB

  • Daily, monthly, annual tables can be changed, if necessary, they are used to speed up queries, such as "Get monthly totals for these 5 identifiers for the last 5 years", in raw data this is about 13 million rows of data, from a monthly table of 300 lines.

  • I have SSIS for me.

  • I cannot afford to lock any tables during the process.
+4
source share
3 answers

700 M records after 5 months means 8.4 B after 5 years (assuming data flow is not growing). Welcome to the world of big data. It is exciting here and we welcome new residents every day :)

I will describe three additional steps you can take. The first two are temporary - at some point you will have too much data and you will have to move on. However, each of them requires more work and / or more money, so it makes sense to do it step by step.

Step 1: the best equipment - scaling

Faster drives, RAID, and more RAM will take you part of the way. Scaling, as it is called, ultimately breaks, but if your data grows linearly and not exponentially, then it will hold you back for a while.

You can also use SQL Server replication to create a copy of your database on another server. Replication works by reading transaction logs and sending them to your replica. Then you can run scripts that create summary (daily, monthly, annual) tables on the secondary server that will not kill the performance of your primary.

Step 2: OLAP

Since you have SSIS at your disposal, start a discussion of multidimensional data. With a good design, OLAP Cubes will take you a long way. They may even be enough to manage billions of records, and you can stay there for several years (this was done, and it lasted two years or so).

Step 3: Scaling

Processing more data by distributing and processing data on multiple machines. When done correctly, it allows you to scale almost linearly - to have more data, and then add more machines to maintain continuous processing.

If you have $$$, use solutions from Vertica or Greenplum (there may be other options, these are the ones I am familiar with).

If you prefer open source / byo, use Hadoop, register event data in files, use MapReduce to process it, store the results in HBase or Hypertable. There are many different configurations and solutions - the entire area is still in its infancy.

+1
source

Indexed views .

Indexed views let you store and index aggregated data. One of the most useful aspects is that you don’t even have to directly reference the view in any of your queries. If someone queries the aggregate that is in the view, the query mechanism will retrieve data from the view instead of checking the underlying table.

You will pay some overhead to update the view as the data changes, but from your scenario it sounds like it would be acceptable.

+1
source

Why don't you create monthly tables just to keep the information you need for these months. It will be like simulating multidimensional tables. Or, if you have access to multidimensional systems (oracle, db2 or so), just work with multidimensionality. This works great with temporary issues like yours. At the moment I don’t have enough information to give you, but you can learn a lot about it just by searching.

Like an idea.

0
source

Source: https://habr.com/ru/post/1348078/


All Articles