A faster alternative to Series.add in pandas

Question

A faster alternative to Series.add in pandas

I am trying to add two pandas series together. The first series is very large and has MultiIndex. The index of the second series is a small subset of the index of the first.

df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack() df1 = pd.DataFrame(df1, columns = ['total']) df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]]) # df2 is tiny subset of df1

Using the regular Series.add function takes about 9 seconds for the first time and 2 seconds on subsequent attempts (perhaps because pandas optimizes the way you store df in memory?).

  starttime = time.time() df1.total.add(df2.total,fill_value=0).sum() print "Method 1 took %f seconds" % (time.time() - starttime)

Manually repeating in rows takes about 2/3, and Series.add for the first time and about 1/100 until Series.add is used on subsequent attempts.

  starttime = time.time() result = df1.total.copy() for row_index, row in df2.iterrows(): result[row_index] += row print "Method 2 took %f seconds" % (time.time() - starttime)

The difference in speed is especially noticeable when (like here) Index is MultiIndex.

Why is Series.add not working well here? Any suggestions on speeding this up? Is there a better alternative to iterating over each element of the series?

Also, how do I sort or structure a data frame to improve the performance of any of the methods? The second time, any of these methods is much faster. How to get this performance for the first time? Sorting with sort_index only helps a little.

+6

python pandas

amball Nov 07 '13 at 22:23

source share

3 answers

According to HYRY, looking at only a small subset of the df2 index is a more efficient task in this situation. You can do this with the slightly more robust add function (which can populate NaN):

 df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0)

Although the syntax here is not very dry ...

To compare some time information, we can see that adding is not significantly slower, and both are a huge improvement to your naive for the loop:

 In [11]: %%timeit result = df1.total.copy() for row_index, row in df2.iterrows(): result[row_index] += row 100 loops, best of 3: 17.9 ms per loop In [12]: %timeit df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0) 1000 loops, best of 3: 325 µs per loop In [13]: %timeit df1.total[df2.index] += df2.total 1000 loops, best of 3: 283 µs per loop

This is an interesting question (and I can fill it out later) to what relative size it will be faster, but, of course, in this extreme case there is a huge victory ...

Distract from this:

If you write a for loop (in python) to speed things up, you are doing it wrong! :)

+3

Andy hayden Nov 08 '13 at 1:52

source share

I think your second one may be faster in this particular case, because you iterate through a smaller data set (small amount of work) and then only get access to a few components of a larger data set (efficient work thanks to pandas developers).

However, with the .add method, pandas should look at all indexes.

If df1 and df2 are the same length, your first method takes 54 ms, but the second method takes> 2 minutes (on my machine, obviously, YMMV).

+1

Paul h Nov 07 '13 at 10:56

source share

Hyry · Accepted Answer · 2013-11-08T00:45:26+0000

You do not need a loop:

 df1.total[df2.index] += df2.total

A faster alternative to Series.add in pandas

Distract from this:

More articles: