What is the best way to load a huge set of results into memory?

I'm trying to load 2 huge result sets (source and target) from different DBMSs, but the problem I'm struggling with is getting these 2 huge results in memory.

The following are queries for retrieving data from a source and target:

Sql Server - select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn

Oracle - select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn

Records in a source: 12377200

Target Records: 12266800

Below are the approaches I tried with some statistics:

1) an open data reading approach for reading source and target data :

 Total jobs running in parallel = 3 Time taken by Job1 = 01:47:25 Time taken by Job1 = 01:47:25 Time taken by Job1 = 01:48:32 There is no index on Id Column. 

Most of the time is spent here: var dr = command.ExecuteReader();

Problems: There are also timeout problems for which I have to keep commandtimeout 0(infinity) and this is bad.

2) The approach to reading in parts to read the source and target data:

  Total jobs = 1 Chunk size : 100000 Time Taken : 02:02:48 There is no index on Id Column. 

3) The approach to reading in parts to read the source and target data:

  Total jobs = 1 Chunk size : 100000 Time Taken : 00:39:40 Index is present on Id column. 

4) an open data reading approach for reading source and target data:

  Total jobs = 1 Index : Yes Time: 00:01:43 

5) an open data reading approach for reading source and target data:

  Total jobs running in parallel = 3 Index : Yes Time: 00:25:12 

I noticed that although the LinkedColumn index does improve performance, the problem is that we are dealing with a third-party RDBMS table that may not have an index.

We would like to keep the database server as free as possible so that the data reading approach does not seem like a good idea, because in parallel there will be many tasks that will put so much pressure on the database server that we do not need.

Therefore, we want to extract records from the resource’s memory from the source to the target and compare 1 to 1 records to free the database server.

Note: I want to do this in my C # application and do not want to use SSIS or Linked Server.

Update:

 Source Sql Query Execution time in sql server management studio: 00:01:41 Target Sql Query Execution time in sql server management studio:00:01:40 

What would be the best way to read a huge set of results in mind?

Code:

 static void Main(string[] args) { // Running 3 jobs in parallel //Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()), //Task<string>.Factory.StartNew(() => Compare()), //Task<string>.Factory.StartNew(() => Compare()) //}; Compare();//Run single job Console.ReadKey(); } public static string Compare() { Stopwatch stopwatch = new Stopwatch(); stopwatch.Start(); var srcConnection = new SqlConnection("Source Connection String"); srcConnection.Open(); var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection); var tgtConnection = new SqlConnection("Target Connection String"); tgtConnection.Open(); var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection); var drA = GetReader(command1); var drB = GetReader(command2); stopwatch.Stop(); string a = stopwatch.Elapsed.ToString(@"d\.hh\:mm\:ss"); Console.WriteLine(a); return a; } private static IDataReader GetReader(SqlCommand command) { command.CommandTimeout = 0; return command.ExecuteReader();//Culprit } 
+6
source share
7 answers

There is nothing (I know) faster than a DataReader for fetching db records.

Working with large databases has its own problems, and reading 10 million records in less than 2 seconds is pretty good.

If you want faster, you can:

  • Jdwend suggestion:

Use sqlcmd.exe and the Process class to run the query and put the results in a csv file and then read csv in C #. sqlcmd.exe is designed for archiving large databases and is 100 times faster than the C # interface. Using linq methods is also faster than the SQL Client class

  1. Paralyze your queries and get simultaneous merging of the results: https://shahanayyub.wordpress.com/2014/03/30/how-to-load-large-dataset-in-datagridview/

  2. The easiest (and IMO is the best for SELECT *) is to throw hardware at it: https://blog.codinghorror.com/hardware-is-cheap-programmers-are-expensive/

Also make sure that you are testing PROD equipment in release mode, as this may distort your tests.

+3
source

This is the template I am using. It receives data for a specific set of records in an instance of System.Data.DataTable , and then closes and deletes all unmonitored ASAP resources. The template also works for other providers in System.Data include System.Data.OleDb , System.Data.SqlClient , etc. I believe that the Oracle Client SDK implements the same template.

 // don't forget this using statements using System.Data; using System.Data.SqlClient; // here the code. var connectionstring = "YOUR_CONN_STRING"; var table = new DataTable("MyData"); using (var cn = new SqlConnection(connectionstring)) { cn.Open(); using (var cmd = cn.CreateCommand()) { cmd.CommandText = "Select [Fields] From [Table] etc etc"; // your SQL statement here. using (var adapter = new SqlDataAdapter(cmd)) { adapter.Fill(table); } // dispose adapter } // dispose cmd cn.Close(); } // dispose cn foreach(DataRow row in table.Rows) { // do something with the data set. } 
0
source

I think I would deal with this problem differently.

But before making some assumptions:

  • According to your description of the question, you will receive data from SQL Server and Oracle
  • Each request returns a bunch of data
  • You do not indicate what it makes sense to receive all this data in memory, and also not to use it.
  • I assume that the data that you will process will be used several times, and you will not repeat both requests several times.
  • And everything that you do with the data will probably not be displayed to the user at the same time.

Given these grounds, I would process the following:

  • Think of this problem as data processing
  • You have a third database or another place with auxiliary database tables in which you can save the entire result from 2 queries.
  • To avoid timeouts, etc., try to get the data using pagging (get thousands at a time) and save them in these tables Aux DB and NOT in the memory "RAM".
  • Once your logic completes the entire data loading (import migration), you can start processing it.
  • Data processing is a key moment for database engines, they are efficient and a lot of evolution over the years, do not waste time inventing the wheel. Use some stored procedure for "crunch / process / merge" from 2 auxiliary tables, only 1.
  • Now that you have all the “merged” data in the 3rd aux table, now you can use it to display or something else that you need to use.
0
source

If you want to read it faster, you must use the original API to receive data faster. Avoid frameworks like linq and rely on DataReader. Try checking the weather, you need something like a dirty read (with (nolock) in sql server).

If your data is very large, try a partial read. Something like indexing your data. Perhaps you can set a condition in which the date is from until everything is selected.

After that, you should consider using Threading on your system to parallelize the thread. Actually 1 thread to get from work 1, another thread to get from work 2. This will cut a lot of time.

0
source

Technically, I think there is a more fundamental problem here.

select [...] order by LinkedColumn

I notice that although the LinkedColumn index improves performance, the problem is that we are dealing with third-party RDBMS tables that may or may not have an index.

We would like the database server to be as free as possible

If you cannot guarantee that the database has a tree-based index in this column, that means the database will be pretty busy sorting your millions of items. He is a slow and resource hunger. Get rid of order by in the SQL statement and run it on the application side to get results faster and reduce the load on the DB ... or make sure the DB has such an index !!!

... depending on whether this selection is a common or rare operation, you will either want to enter the correct index into the database, or simply extract it and sort it on the client side.

0
source

I had a similar situation many years ago. Before I looked at the problem, it took 5 days to move the data between two systems using SQL.

I used a different approach.

We only extracted data from the source system into a small number of files representing a smoothed data model, and ordered the data in each file so that they all naturally flow in the correct sequence as we read from the files.

Then I wrote a Java program that processed these flattened data files and created separate table load files for the target system. So, for example, the source extract had less than a dozen data files from the source system, which turned into 30-40 or so downloaded files for the target database.

This process will start in just a few minutes, and I turned on a full audit and error reports, and we could quickly identify problems and discrepancies in the source data, fix them and start the processor again.

The last part of the puzzle was a multi-threaded utility that I wrote, which did a parallel download of each download file to the target Oracle database. This utility created a Java process for each table and used the Oracle array load program to quickly enter data into Oracle DB.

When all was said and done, the 5-day transmission of millions of SQL-SQL records was only 30 minutes, using a combination of Java and Oracle download capabilities. And there were no errors, and we took into account every penny of every account that was transferred between systems.

So perhaps think outside the box of SQL and use Java, the file system and the Oracle bulk loader. And make sure you make the IO file on solid state drives.

0
source

If you need to process large Java databases, you can choose JDBC to give you the required low level of control. On the other hand, if you already use ORM in your application, returning to JDBC can mean additional pain. You will lose features such as optimistic locking, caching, automatic selection when navigating through a domain model, etc. Fortunately, most ORMs such as Hibernate have some options that will help you with this. Although these methods are not new, there are several options to choose from.

simplified example; Suppose we have a table (mapped to the DemoEntity class) with 100,000 records. Each record consists of a single column (mapped to the property property in DemoEntity) containing some random alphanumeric data of about ~ 2 KB. JVM works with -Xmx250m. Suppose 250 MB is the total maximum memory that can be allocated by the JVM on our system. Your task is to read all the records that are currently in the table, perform some additional processing, and finally save the result. We assume that the objects obtained as a result of our mass operation are not changed.

0
source

Source: https://habr.com/ru/post/1275361/


All Articles