SQL database schema schema for 3 billion large database

Get your teen. Can you solve it?

I am developing a product database for SQL Server 2008 R2 Ed. (not Enterprise Ed.), which will be used to store custom product configurations for more than 30,000 different products. The database will have up to 500 users at a time.

Here is the design problem ...

Each product has a collection of parts (up to 50 parts per product).
Therefore, if I have 30,000 products, and each of them can have up to 50 parts, that is, 1.5 million different “from product to part” relationships

or as an equation… 

30,000 (Products) X 50 (Parts) = 1.5 million Product-to-Parts records.

... and If ...

Each part can have up to 2000 finishing options (finish - paint color).

NOTE. Only one finish will be selected by the user at run time. The 2000 completion options that I need to save are the allowed options for a specific part on a particular product.

So, if I have 1.5 million different relationships / entries between separate parts, and each of these parts can have up to 2000 finishes, which is 3 billion allowable relationships between articles and entries

 or as an equation… 

1.5 million (parts) x 2000 (finish) = 3 billion product-to-parts records.

How can I create this database so that I can fulfill quick and effective queries for a specific product and return my list of "Parts" and all valid finishes for each part without 3 billion. Records from part to finish? Read time is more important than write time.

Please post your thoughts / suggestions if you have experience working with large databases.

Thanks!

+4
source share
5 answers

Why is it even remotely difficult? If there is only one relational database, that will be exactly the problem you are describing: 3 tables and 2 many-to-many relationships. The number "3 billion" appears only if a cursory full Cartesian alliance remains. Just follow the basic, normalized design:

 :setvar dbname test :setvar PRODUCTSCOUNT 30000 :setvar PARTSCOUNT 5000 :setvar FINISHESCOUNT 2000 :setvar PRODUCTSPARTS 50 :setvar PARTFINISHES 1 use master; set nocount on; go rollback go :on error exit if db_id('$(dbname)') is not null begin alter database [$(dbname)] set single_user with rollback immediate; drop database [$(dbname)]; end go create database [$(dbname)] on (name = test_data, filename='c:\temp\test.mdf', size = 10GB) log on (name = test_log, filename='c:\temp\test.ldf', size = 100MB); go use [$(dbname)]; go create table Products ( Product_Id int not null identity(0,1) primary key, Description varchar(256)); go create table Parts ( Part_Id int not null identity(0,1) primary key, Description varchar(256)); create table Finishes ( Finish_Id smallint not null identity(0,1) primary key, Description varchar(256)); create table ProductParts ( Product_Id int not null, Part_Id int not null, constraint fk_products_parts_product foreign key (Product_Id) references Products (Product_Id), constraint fk_product_parts_part foreign key (Part_Id) references Parts (Part_Id), constraint pk_product_parts primary key (Product_Id, Part_Id)); create table PartFinishes ( Part_Id int not null, Finish_Id smallint not null, constraint fk_part_finishes_part foreign key (Part_Id) references Parts (Part_Id), constraint fk_part_finishes_finish foreign key (Finish_Id) references Finishes (Finish_Id), constraint pk_part_finishes primary key (Part_Id, Finish_Id)); go -- populate Products declare @cnt int = 0, @description varchar(256); begin transaction; while @cnt < $(PRODUCTSCOUNT) begin set @description = 'Product ' + cast(@cnt as varchar(10)); insert into Products (Description) values (@description); set @cnt += 1; if @cnt % 1000 = 0 begin commit; raiserror (N'Inserted %d products', 0,1, @cnt); begin transaction; end end commit; raiserror (N'Done. %d products', 0,1, @cnt); go -- populate Parts declare @cnt int = 0, @description varchar(256); begin transaction; while @cnt < $(PARTSCOUNT) begin set @description = 'Part ' + cast(@cnt as varchar(10)); insert into Parts (Description) values (@description); set @cnt += 1; if @cnt % 1000 = 0 begin commit; raiserror (N'Inserted %d parts', 0,1, @cnt); begin transaction; end end commit; raiserror (N'Done. %d parts', 0,1, @cnt); go -- populate Finishes declare @cnt int = 0, @description varchar(256); begin transaction; while @cnt < $(FINISHESCOUNT) begin set @description = 'Finish ' + cast(@cnt as varchar(10)); insert into Finishes (Description) values (@description); set @cnt += 1; if @cnt % 1000 = 0 begin commit; raiserror (N'Inserted %d finishes', 0,1, @cnt); begin transaction; end end raiserror (N'Done. %d finishes', 0,1, @cnt); commit; go -- populate product parts declare @cnt int = 0, @parts int = 0, @part int, @product int = 0; begin transaction; while @product < $(PRODUCTSCOUNT) begin set @parts = rand() * ($(PRODUCTSPARTS)-1) + 1; set @part = rand() * $(PARTSCOUNT); while 0 < @parts begin insert into ProductParts (Product_Id, Part_Id) values (@product, @part); set @parts -= 1; set @part += rand()*10+1; if @part >= $(PARTSCOUNT) set @part = rand()*10; set @cnt += 1; if @cnt % 1000 = 0 begin commit; raiserror (N'Inserted %d product-parts', 0,1, @cnt); begin transaction; end end set @product += 1; end commit; raiserror (N'Done. %d product-parts', 0,1, @cnt); go -- populate part finishes declare @cnt int = 0, @part int = 0, @finish int, @finishes int; begin transaction; while @part < $(PARTSCOUNT) begin set @finishes = rand() * ($(PARTFINISHES)-1) + 1; set @finish = rand() * $(FINISHESCOUNT); while 0 < @finishes begin insert into PartFinishes (Part_Id, Finish_Id) values (@part, @finish); set @finish += rand()*10+1; if @finish >= $(FINISHESCOUNT) set @finish = rand()*10+1; set @finishes -= 1; set @cnt += 1; if @cnt % 1000 = 0 begin commit; raiserror (N'Inserted %d part-finishes', 0,1, @cnt); begin transaction; end end set @part += 1; end commit; raiserror (N'done. %d part-finishes', 0,1, @cnt); go 

Now, if we put this through a basic test, the results will be good:

 set statistics time on; set statistics io on; declare @product int = rand()*30000; select * from Products po join ProductParts pp on po.Product_Id = pp.Product_Id join Parts pa on pa.Part_Id = pp.Part_Id join PartFinishes pf on pf.Part_Id = pa.Part_Id join Finishes f on pf.Finish_id = f.Finish_Id where po.Product_Id = @product; 

Lead time:

 (33 row(s) affected) Table 'Finishes'. Scan count 0, logical reads 66, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Parts'. Scan count 0, logical reads 66, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'PartFinishes'. Scan count 33, logical reads 66, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'ProductParts'. Scan count 1, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. Table 'Products'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0. SQL Server Execution Times: CPU time = 0 ms, elapsed time = 5 ms. 

This is a 5 ms runtime for a random product. And it’s nowhere near the “server”, I run it on my laptop. There are no surprises; all calls are covered by clustered indexes in these tables. I will let you set up a stress test for 500 users and measure for yourself how it works in concurrency. I expect him to hold well.

+6
source

Firstly, 3 billion is the upper limit. Most likely, you will have much fewer combinations in the real world. Nonetheless...

First of all, there are good indexes.

Secondly, you need to have enough bars on the server (and processor power) to handle the types of requests that you can execute.

So what will be your requests?

I assume that your products will be grouped / classified in some way.

If this is an ordering system, then this means that requests at this level are likely to return, perhaps a few hundred products at a time.

After selecting a product, you download things such as related parts for the selected product (s). Again, this will result in less than 50 records being returned for each product. Pretty small. The amount of data for the finish types is also not so large.

Even if this is just a reference system, the amount of data used in any one request is not so good.

So, really, what remains is just physical storage and RAM. Physical storage must be large enough to hold data. Perhaps about 1 GB or so; which is still pretty small.

For RAM, it will be enough for you so that the SQL server can store the corresponding tables in memory. If the physical size is approximately right, I would say that a system with 8 GB is just fine, maybe a quad-core processor depending on the load. They are cheap, so there are two.

You mentioned 500 users, but what are the types of workload for these users? Are they all constantly constantly? How often do they request a server? How much data do they need right away?

These questions will lead you to find out the actual number of queries per second (and type) that will be required to support the database equipment.

As a side note, your calculations are deleted. As an example, you should not multiply the total number of completion options by the total number of products / parts. I seriously doubt that there is any part with the choice of paint color in 2000.

The best way to calculate this is to see how many MEAN trim parameters have the time when the number of MEAN parts of this product is MEAN. Then you will have a closer idea of ​​the number of possible combinations. But this is just a useless data point, as this number really makes little sense, given the potential queries anyway ...

+3
source

Two things:

  • Specify the columns to be queried.
  • If you like 80:20 :: read:write use caching for example. Memcached
0
source

An alternative is the transition of an object-relational or nested-relational, so that the identifiers of the parts included in the assembly are contained in the assembly record, and are not associated with the assembly through an intermediate table.

0
source

The @Remus design will work well, but you will need to split the table into some columns. This will speed up query execution. You can easily archive using the switch statement if it is in a section.

Even if you have a pointer to all procs, it can still be slower for a large number of lines without a section. You can use a product code or something that tells you that it is a new product or an old one based on some naming conventions. You can include this section column in your queries and it will be fast.

Importing a lot of data into this table is also a problem. You need to think about which indexes to rebuild after insertion and time. If you add data to a new section, performance will be acceptable.

0
source

Source: https://habr.com/ru/post/1333600/


All Articles