Why is DataFrames concatenation exponentially slower?

I have a function that processes a DataFrame, mainly for processing data in buckets, creates a binary matrix of functions in a specific column using pd.get_dummies(df[col]) .

In order not to process all my data with this function at once (which goes out of memory and causes iPython to crash), I broke a large DataFrame into pieces using:

 chunks = (len(df) / 10000) + 1 df_list = np.array_split(df, chunks) 

pd.get_dummies(df) will automatically create new columns based on the contents of df[col] , and they may differ for each df in df_list .

After processing, I will combine the DataFrames back using:

 for i, df_chunk in enumerate(df_list): print "chunk", i [x, y] = preprocess_data(df_chunk) super_x = pd.concat([super_x, x], axis=0) super_y = pd.concat([super_y, y], axis=0) print datetime.datetime.utcnow() 

The processing time of the first block is quite acceptable, but it grows per piece! This is not related to preprocess_data(df_chunk) , since there is no reason to increase it. Is this an increase in time resulting from calling pd.concat() ?

See the log below:

 chunks 6 chunk 0 2016-04-08 00:22:17.728849 chunk 1 2016-04-08 00:22:42.387693 chunk 2 2016-04-08 00:23:43.124381 chunk 3 2016-04-08 00:25:30.249369 chunk 4 2016-04-08 00:28:11.922305 chunk 5 2016-04-08 00:32:00.357365 

Is there a way to overtake this? I have 2900 pieces to process, so any help is appreciated!

Discover any other suggestions in Python!

+24
performance python pandas concatenation processing-efficiency
Apr 08 '16 at 0:34
source share
2 answers

Never call DataFrame.append or pd.concat inside a for loop. This results in quadratic copying.

pd.concat returns a new DataFrame. Space must be allocated for the new DataFrame and data from the old DataFrames must be copied to the new DataFrame. Consider the amount of copy required by this line inside the for-loop (assuming each x has size 1):

 super_x = pd.concat([super_x, x], axis=0) | iteration | size of old super_x | size of x | copying required | | 0 | 0 | 1 | 1 | | 1 | 1 | 1 | 2 | | 2 | 2 | 1 | 3 | | ... | | | | | N-1 | N-1 | 1 | N | 

1 + 2 + 3 + ... + N = N(N+1)/2 . So there are O(N**2) copies needed to complete the loop.

Now consider

 super_x = [] for i, df_chunk in enumerate(df_list): [x, y] = preprocess_data(df_chunk) super_x.append(x) super_x = pd.concat(super_x, axis=0) 

Adding to the list is an O(1) operation and does not require copying. Currently, after the loop finishes, one pd.concat call is pd.concat . This call to pd.concat requires N copies to be made because super_x contains N data frames of size 1. Therefore, when building in this way super_x requires O(N) copies.

+37
Apr 08 '16 at 0:53
source share

Each time you merge, you return a copy of the data.

You want to keep a list of your pieces, and then put everything together as a last step.

 df_x = [] df_y = [] for i, df_chunk in enumerate(df_list): print "chunk", i [x, y] = preprocess_data(df_chunk) df_x.append(x) df_y.append(y) super_x = pd.concat(df_x, axis=0) del df_x # Free-up memory. super_y = pd.concat(df_y, axis=0) del df_y # Free-up memory. 
+7
08 Apr '16 at 0:53
source share



All Articles