Mass paste strategy from C # to SQL Server

In our current project, clients will send a collection of complex / nested messages to our system. The frequency of these messages is approx. 1000-2000 msg / per seconds.

These complex objects contain transaction data (added), as well as master data (which will be added if they are not found). But instead of passing the identifiers of the master data, the client passes the "name" column.

The system checks for master data for these names. If found, it uses the identifiers from the database, first create this master data, and then use these identifiers.

After removing the basic data identifiers, the system inserts the transaction data into the SQL Server database (using the master data identifiers). The number of main objects per message is about 15-20.

Below are some strategies that we can adopt.

  • We can enable the master identifiers first from our C # code (and insert the main data, if not found), and store these identifiers in the C # cache. Once all identifiers are resolved, we can insert bulk transaction data using the SqlBulkCopy class. We can hit the database 15 times to get identifiers for different objects, and then hit the database again to insert the final data. We can use the same connection that closes it after all this processing has been completed.

  • We can send all these messages containing the main data and transactional data, in a single hit in the database (in the form of several TVPs), and then inside the stored procedure, first create the basic data for the missing, and then insert the transactional data.

Can anyone suggest a better approach in this case?

Due to some privacy issue, I cannot share the actual structure of the object. But here is the hypothetical structure of the object, which is very close to our business object.

One of these messages will contain information about one product (its main data) and its price data (transaction data) from different suppliers:

Master data (which must be added if not found)

Product Name: ABC, ProductCateory: XYZ, Manufacturer: XXX and some other other details (number of properties is in the range of 15-20).

Transaction data (which will always be added)

Supplier Name: A, ListPrice: XXX, Discount: XXX

Supplier Name: B, ListPrice: XXX, Discount: XXX

Supplier Name: C, ListPrice: XXX, Discount: XXX

Supplier Name: D, ListPrice: XXX, Discount: XXX

Most of the information about master data will remain unchanged for a message belonging to one product (and will change less frequently), but transaction data will always fluctuate. Thus, the system will check whether the product exists in the "XXX" system or not. If not, check if the “Category” mentioned in this product exists. If not, he will add a new entry for the category and then for the product. This will be done for the manufacturer and other master data.

Several suppliers will send data on several products (2000-5000) at the same time.

So, suppose we have 1000 suppliers, each supplier sends data about 10-15 different products. Every 2-3 seconds, each supplier sends us a price update for these 10 products. He may start sending data about new products, but it will not be very often.

+6
source share
2 answers

Most likely, you will be better off with idea # 2 (i.e. sending all 15-20 objects to the database in one shot using several TVPs and processing a whole set of up to 2000 messages).

Finding application-level caching master data and translating it before sending it to the database sounds great, but misses something:

  • You will have to hit the DB to get the original list anyway
  • You will have to hit the database to insert new records anyway
  • Finding values ​​in a dictionary to replace with identifiers is exactly what the database does (suppose there is a non-clustered pointer for each of these search queries by name)
  • Often requested values ​​will be cached in the buffer pool (which is the memory cache).

Why duplicate at the application level what is already provided for and is happening right now at the database level, especially considering:

  • 15 to 20 objects can have up to 20 thousand records (which is a relatively small number, especially considering that for a non-clustered index only two fields are required: Name and ID , which can pack many rows on a single data page when using 100% - fill factor).
  • Not all 20k entries are “active” or “current”, so you don’t have to worry about caching all of them. Thus, any current values ​​will be easily identified as those requested, and those data pages (which may include some inactive records, but not many there) will be those that will be cached in the buffer pool.

Therefore, you don’t have to worry about aging old records or forcing keys to expire or reload due to a possible change in values ​​(i.e. an updated Name for a specific ID ), as this is handled naturally.

Yes, in-memory caching is a wonderful technology and significantly speeds up work with websites, but these scenarios / use cases are when processes without a database request the same data again and again in pure read-only mode. But this particular scenario is one in which data is combined, and the list of search values ​​can change frequently (moreso because of new records than because of updated records).


With everything said, Option No. 2 is the way to go. I have done this technique several times with great success, although not with 15 TVPs. Perhaps some optimizations / adjustments need to be configured to tune this particular situation, but what I found to work fine:

  • Receive data via TVP. I prefer this over SqlBulkCopy because:
    • it makes for an easily autonomous stored procedure
    • it fits very well in the application code to completely flush the collection into the database without having to copy the collection to the DataTable first, which duplicates the collection that wastes CPU and memory.This requires you to create a method for each collection that returns IEnumerable<SqlDataRecord> , takes the collection as input and uses yield return; to send each entry in a for or foreach .
  • TVPs are not very good for statistics and, therefore, are not suitable for JOINing (although this can be mitigated using TOP (@RecordCount) in queries), but you do not need to worry about this, since they are only used to populate real tables with any missing values
  • Step 1: Insert the missing names for each object. Remember that for each object there must be a NonClustered index in the [Name] field, and assuming the identifier is a clustered index, this value will naturally be part of the index, so [Name] will provide a coverage index in addition to help in next operation. And also remember that any previous executions for this client (i.e., approximately the same entity values) will cause data pages for these indexes to remain cached in the buffer pool (i.e., in memory).

     ;WITH cte AS ( SELECT DISTINCT tmp.[Name] FROM @EntityNumeroUno tmp ) INSERT INTO EntityNumeroUno ([Name]) SELECT cte.[Name] FROM cte WHERE NOT EXISTS( SELECT * FROM EntityNumeroUno tab WHERE tab.[Name] = cte.[Name] ) 
  • Step 2: INSERT all the “messages” into a simple INSERT...SELECT , where the data pages for lookup tables (ie, “entities”) are already cached in the buffer pool because of step 1


Finally, keep in mind that hypotheses / assumptions / educated guesses do not replace testing. You need to try several methods to see what works best for your specific situation, as there may be additional details that have not been separated that may affect what is considered “ideal” here.

I will say that if only inserts are inserted, then Vlad’s idea can be faster. The method that I describe here I used in situations that were more complex and required complete synchronization (updates and deletions), as well as additional checks and the creation of relevant operational data (not search values). Using SqlBulkCopy can be faster on direct inserts (although for only 2000 entries I doubt there is a big difference, if any), but this assumes that you load directly into the destination tables (messages and search queries), and not into the intermediate / intermediate tables (and I believe that Vlad’s idea is SqlBulkCopy directly in the destination tables). However, as stated above, using an external cache (i.e., not a buffer pool) is also more error prone due to the problem of updating search values. This can lead to more code than the external cache is invalidated, especially if using the external cache is only slightly faster. You must consider the additional risk / service in which the method as a whole is best suited to your needs.


UPDATE

Based on the information provided in the comments, we now know:

  • There are several suppliers
  • There are several products offered by each supplier.
  • Products are not unique to the Supplier; Products are sold by 1 or more suppliers.
  • Product properties are the only ones
  • Price information has properties that can have multiple entries.
  • Pricing information is INSERT only (i.e. time history)
  • Unique product defined by SKU (or similar field)
  • After creating a product that comes with an existing SKU, but with other properties (for example, category, manufacturer, etc.), it will be considered the same product; differences will be ignored.

With all of this in mind, I still recommend TVP, but rethink the approach and make it vendor-oriented rather than product-oriented. The provider is supposed to send files each time. So when you get the file, import it. The only search you do well in advance is the provider. Here is the basic layout:

  • It seems reasonable to assume that you already have VendorID at the moment, because why should the system import the file from an unknown source?
  • You can import into packages
  • Create a SendRows method that:
    • accepts a filestream or something that allows you to move through a file
    • accepts something like int BatchSize
    • returns IEnumerable<SqlDataRecord>
    • creates an SqlDataRecord to match the TVP structure
    • for loops though FileStream until BatchSize or more entries in the file are encountered
    • perform any necessary data checks
    • match data with SqlDataRecord
    • call yield return;
  • Open file
  • While the file has data
    • stored procedure call
    • go to VendorID
    • go to SendRows(FileStream, BatchSize) for TVP
  • Close the file
  • Experiment with:
    • opening SqlConnection before looping around FileStream and closing it after looping
    • Opening SqlConnection, executing a stored procedure, and closing SqlConnection inside a FileStream loop
  • Experiment with different BatchSize values. Start with 100, then 200, 500, etc.
  • The stored procedure will process the insertion of new products

Using this type of structure, you will send product properties that are not used (i.e. only SKUs are used to search for existing products). BUT, it scales very well, as there are no restrictions on file size. If the Seller ships 50 Products, a penalty. If they send 50 thousand Products, a fine. If they send 4 million products (this is the system I was working on, and it was processing product information updates that were different for any of its features!), Then that's fine. There is no increase in memory at the application level or database level for processing even 10 million products. The time spent on imports should increase with the number of products shipped.


UPDATE 2
New source data information:

  • comes from an Azure EventHub
  • comes as C # objects (no files)
  • Product details are part of the OP system API
  • going in a separate queue (just insert the data from the insert into the database)

If the data source is C # objects, then I would most definitely use TVP, since you can send it through the method described in my first update (i.e. the method that returns IEnumerable<SqlDataRecord> ). Send one or more TVPs for price / offer for each supplier information, but regularly enter parameters for specific property attributes. For instance:

 CREATE PROCEDURE dbo.ImportProduct ( @SKU VARCHAR(50), @ProductName NVARCHAR(100), @Manufacturer NVARCHAR(100), @Category NVARCHAR(300), @VendorPrices dbo.VendorPrices READONLY, @DiscountCoupons dbo.DiscountCoupons READONLY ) SET NOCOUNT ON; -- Insert Product if it doesn't already exist IF (NOT EXISTS( SELECT * FROM dbo.Products pr WHERE pr.SKU = @SKU ) ) BEGIN INSERT INTO dbo.Products (SKU, ProductName, Manufacturer, Category, ...) VALUES (@SKU, @ProductName, @Manufacturer, @Category, ...); END; ...INSERT data from TVPs -- might need OPTION (RECOMPILE) per each TVP query to ensure proper estimated rows 
+2
source

From the point of view of the database there is no such fast thing as BULK INSERT (for example, from csv files). It is best to massage all the data as soon as possible, and then process it using stored procedures.

The C # level will simply slow down the process, since all queries between C # and SQL will be thousands of times slower than Sql-Server can handle directly.

0
source

Source: https://habr.com/ru/post/982866/


All Articles