I'm trying to load 2 huge result sets (source and target) from different DBMSs, but the problem I'm struggling with is getting these 2 huge results in memory.
The following are queries for retrieving data from a source and target:
Sql Server - select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn
Oracle - select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn
Records in a source: 12377200
Target Records: 12266800
Below are the approaches I tried with some statistics:
1) an open data reading approach for reading source and target data :
Total jobs running in parallel = 3 Time taken by Job1 = 01:47:25 Time taken by Job1 = 01:47:25 Time taken by Job1 = 01:48:32 There is no index on Id Column.
Most of the time is spent here: var dr = command.ExecuteReader();
Problems: There are also timeout problems for which I have to keep commandtimeout 0(infinity) and this is bad.
2) The approach to reading in parts to read the source and target data:
Total jobs = 1 Chunk size : 100000 Time Taken : 02:02:48 There is no index on Id Column.
3) The approach to reading in parts to read the source and target data:
Total jobs = 1 Chunk size : 100000 Time Taken : 00:39:40 Index is present on Id column.
4) an open data reading approach for reading source and target data:
Total jobs = 1 Index : Yes Time: 00:01:43
5) an open data reading approach for reading source and target data:
Total jobs running in parallel = 3 Index : Yes Time: 00:25:12
I noticed that although the LinkedColumn index does improve performance, the problem is that we are dealing with a third-party RDBMS table that may not have an index.
We would like to keep the database server as free as possible so that the data reading approach does not seem like a good idea, because in parallel there will be many tasks that will put so much pressure on the database server that we do not need.
Therefore, we want to extract records from the resource’s memory from the source to the target and compare 1 to 1 records to free the database server.
Note: I want to do this in my C # application and do not want to use SSIS or Linked Server.
Update:
Source Sql Query Execution time in sql server management studio: 00:01:41 Target Sql Query Execution time in sql server management studio:00:01:40
What would be the best way to read a huge set of results in mind?
Code:
static void Main(string[] args) { // Running 3 jobs in parallel //Task<string>[] taskArray = { Task<string>.Factory.StartNew(() => Compare()), //Task<string>.Factory.StartNew(() => Compare()), //Task<string>.Factory.StartNew(() => Compare()) //}; Compare();//Run single job Console.ReadKey(); } public static string Compare() { Stopwatch stopwatch = new Stopwatch(); stopwatch.Start(); var srcConnection = new SqlConnection("Source Connection String"); srcConnection.Open(); var command1 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Source order by LinkedColumn", srcConnection); var tgtConnection = new SqlConnection("Target Connection String"); tgtConnection.Open(); var command2 = new SqlCommand("select Id as LinkedColumn,CompareColumn from Target order by LinkedColumn", tgtConnection); var drA = GetReader(command1); var drB = GetReader(command2); stopwatch.Stop(); string a = stopwatch.Elapsed.ToString(@"d\.hh\:mm\:ss"); Console.WriteLine(a); return a; } private static IDataReader GetReader(SqlCommand command) { command.CommandTimeout = 0; return command.ExecuteReader();//Culprit }