I have a function that processes a DataFrame, mainly for processing data in buckets, creates a binary matrix of functions in a specific column using pd.get_dummies(df[col]) .
In order not to process all my data with this function at once (which goes out of memory and causes iPython to crash), I broke a large DataFrame into pieces using:
chunks = (len(df) / 10000) + 1 df_list = np.array_split(df, chunks)
pd.get_dummies(df) will automatically create new columns based on the contents of df[col] , and they may differ for each df in df_list .
After processing, I will combine the DataFrames back using:
for i, df_chunk in enumerate(df_list): print "chunk", i [x, y] = preprocess_data(df_chunk) super_x = pd.concat([super_x, x], axis=0) super_y = pd.concat([super_y, y], axis=0) print datetime.datetime.utcnow()
The processing time of the first block is quite acceptable, but it grows per piece! This is not related to preprocess_data(df_chunk) , since there is no reason to increase it. Is this an increase in time resulting from calling pd.concat() ?
See the log below:
chunks 6 chunk 0 2016-04-08 00:22:17.728849 chunk 1 2016-04-08 00:22:42.387693 chunk 2 2016-04-08 00:23:43.124381 chunk 3 2016-04-08 00:25:30.249369 chunk 4 2016-04-08 00:28:11.922305 chunk 5 2016-04-08 00:32:00.357365
Is there a way to overtake this? I have 2900 pieces to process, so any help is appreciated!
Discover any other suggestions in Python!
performance python pandas concatenation processing-efficiency
jfive Apr 08 '16 at 0:34 2016-04-08 00:34
source share