I am trying to add two pandas series together. The first series is very large and has MultiIndex. The index of the second series is a small subset of the index of the first.
df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack() df1 = pd.DataFrame(df1, columns = ['total']) df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]]) # df2 is tiny subset of df1
Using the regular Series.add function takes about 9 seconds for the first time and 2 seconds on subsequent attempts (perhaps because pandas optimizes the way you store df in memory?).
starttime = time.time() df1.total.add(df2.total,fill_value=0).sum() print "Method 1 took %f seconds" % (time.time() - starttime)
Manually repeating in rows takes about 2/3, and Series.add for the first time and about 1/100 until Series.add is used on subsequent attempts.
starttime = time.time() result = df1.total.copy() for row_index, row in df2.iterrows(): result[row_index] += row print "Method 2 took %f seconds" % (time.time() - starttime)
The difference in speed is especially noticeable when (like here) Index is MultiIndex.
Why is Series.add not working well here? Any suggestions on speeding this up? Is there a better alternative to iterating over each element of the series?
Also, how do I sort or structure a data frame to improve the performance of any of the methods? The second time, any of these methods is much faster. How to get this performance for the first time? Sorting with sort_index only helps a little.
source share