Extreme performance difference when using DataTable.Add

Question

Extreme performance difference when using DataTable.Add

Take a look at the program below. This is pretty clear, but I’ll explain it anyway :)

I have two methods: one fast and one slow. These methods do the same: they create a table with 50,000 rows and 1,000 columns. I am writing a variable number of columns in a table. In the code below, I selected 10 ( NUM_COLS_TO_WRITE_TO ).

In other words, only 10 columns out of 1000 will actually contain data. OK. The only difference between the two methods is that the fast one fills the columns and then calls DataTable.AddRow , while the slow one does it after. What is it.

The difference in performance, however, is shocking (I don't care). The fast version is almost completely unaffected by the change in the number of columns that we write, while the slow goes linearly. For example, when the number of columns I'm writing is 20, the fast version takes 2.8 seconds, but the slow version takes more than a minute.

What could be here in the world?

I thought that maybe adding dt.BeginLoadData would make a difference, and to some extent it would reduce the time from 61 seconds to ~ 50 seconds, but this is still a huge difference.

Of course, the obvious answer is: "Well, don't do it that way." OK. Sure. But what causes this in the world? Is this expected behavior? I did not expect this. :)

 public class Program { private const int NUM_ROWS = 50000; private const int NUM_COLS_TO_WRITE_TO = 10; private const int NUM_COLS_TO_CREATE = 1000; private static void AddRowFast() { DataTable dt = new DataTable(); //add a table with 1000 columns for (int i = 0; i < NUM_COLS_TO_CREATE; i++) { dt.Columns.Add("x" + i, typeof(string)); } for (int i = 0; i < NUM_ROWS; i++) { var theRow = dt.NewRow(); for (int j = 0; j < NUM_COLS_TO_WRITE_TO; j++) { theRow[j] = "whatever"; } //add the row *after* populating it dt.Rows.Add(theRow); } } private static void AddRowSlow() { DataTable dt = new DataTable(); //add a table with 1000 columns for (int i = 0; i < NUM_COLS_TO_CREATE; i++) { dt.Columns.Add("x" + i, typeof(string)); } for (int i = 0; i < NUM_ROWS; i++) { var theRow = dt.NewRow(); //add the row *before* populating it dt.Rows.Add(theRow); for (int j=0; j< NUM_COLS_TO_WRITE_TO; j++){ theRow[j] = "whatever"; } } } static void Main(string[] args) { var sw = Stopwatch.StartNew(); AddRowFast(); sw.Stop(); Console.WriteLine(sw.Elapsed.TotalMilliseconds); sw.Restart(); AddRowSlow(); sw.Stop(); Console.WriteLine(sw.Elapsed.TotalMilliseconds); //When NUM_COLS is 5 //FAST: 2754.6782 //SLOW: 15794.1378 //When NUM_COLS is 10 //FAST: 2777.431 ms //SLOW 32004.7203 ms //When NUM_COLS is 20 //FAST: 2831.1733 ms //SLOW: 61246.2243 ms } }

Update

Calling theRow.BeginEdit and theRow.EndEdit in the slow version makes the slow version more or less constant (~ 4 seconds on my machine). If I actually had some restrictions on the table, I think this might make sense to me.

+6

performance c # .net

aquinas Jan 30 '15 at 21:02

source share

1 answer

steve16351 · Accepted Answer · 2015-01-30T23:31:27+0000

When connecting to a table, much more work is done to record and track the state with each change.

For example, if you do this,

 theRow.BeginEdit(); for (int j = 0; j < NUM_COLS_TO_WRITE_TO; j++) { theRow[j] = "whatever"; } theRow.CancelEdit();

Then in BeginEdit() , internally it takes a copy of the contents of the string, so that at any point, you can roll back - and the final result above - this is an empty string again without whatever . This is possible, even in BeginLoadData mode. Along the BeginEdit path, if it is bound to a DataTable, you enter DataTable.NewRecord () which shows that it simply copies each value for each column to maintain the original state if cancellation is required - there is not much magic here. On the other hand, if it is not bound to a datatable, then it does not happen at BeginEdit in BeginEdit , and it completes quickly.

EndEdit() also quite heavy (when attached), since all restrictions are checked here, etc. (maximum length, whether columns allow zeros, etc.). It also fires a bunch of events, explicitly frees the storage used if editing has been canceled, and makes it available for calling using DataTable.GetChanges() , which is still possible in BeginLoadData . Infact, looking at the source, all BeginLoadData seems to BeginLoadData off constraint checking and indexing.

So this describes what BeginEdit and EditEdit , and they are completely different when they are attached or not attached to the terms of what is stored. Now consider that one theRow[j] = "whatever" , which you can see in the indexer installer for DataRow , it calls BeginEditInternal and then EditEdit on each single call (if it has not been edited yet, because you explicitly called BeginEdit earlier) . Thus, this means that it copies and saves each individual value for each column in the row, each time you make this call. So you do it 10 times, which means that your 1000 columns of DataTable, more than 50,000 rows, means that it allocates 500,000,000 objects. In addition, all other versions, checks and events are triggered after each change, and, in general, it is much slower when a row is bound to a DataTable than when not.

Extreme performance difference when using DataTable.Add

More articles: