A faster alternative to Series.add in pandas

I am trying to add two pandas series together. The first series is very large and has MultiIndex. The index of the second series is a small subset of the index of the first.

df1 = pd.DataFrame(np.ones((1000,5000)),dtype=int).stack() df1 = pd.DataFrame(df1, columns = ['total']) df2 = pd.concat([df1.iloc[50:55],df1.iloc[2000:2005]]) # df2 is tiny subset of df1 

Using the regular Series.add function takes about 9 seconds for the first time and 2 seconds on subsequent attempts (perhaps because pandas optimizes the way you store df in memory?).

  starttime = time.time() df1.total.add(df2.total,fill_value=0).sum() print "Method 1 took %f seconds" % (time.time() - starttime) 

Manually repeating in rows takes about 2/3, and Series.add for the first time and about 1/100 until Series.add is used on subsequent attempts.

  starttime = time.time() result = df1.total.copy() for row_index, row in df2.iterrows(): result[row_index] += row print "Method 2 took %f seconds" % (time.time() - starttime) 

The difference in speed is especially noticeable when (like here) Index is MultiIndex.

Why is Series.add not working well here? Any suggestions on speeding this up? Is there a better alternative to iterating over each element of the series?

Also, how do I sort or structure a data frame to improve the performance of any of the methods? The second time, any of these methods is much faster. How to get this performance for the first time? Sorting with sort_index only helps a little.

+6
source share
3 answers

You do not need a loop:

 df1.total[df2.index] += df2.total 
+4
source

According to HYRY, looking at only a small subset of the df2 index is a more efficient task in this situation. You can do this with the slightly more robust add function (which can populate NaN):

 df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0) 

Although the syntax here is not very dry ...

To compare some time information, we can see that adding is not significantly slower, and both are a huge improvement to your naive for the loop:

 In [11]: %%timeit result = df1.total.copy() for row_index, row in df2.iterrows(): result[row_index] += row 100 loops, best of 3: 17.9 ms per loop In [12]: %timeit df1.total[df2.index] = (df1.total[df2.index]).add(df2.total, fill_value=0) 1000 loops, best of 3: 325 Β΅s per loop In [13]: %timeit df1.total[df2.index] += df2.total 1000 loops, best of 3: 283 Β΅s per loop 

This is an interesting question (and I can fill it out later) to what relative size it will be faster, but, of course, in this extreme case there is a huge victory ...

Distract from this:

If you write a for loop (in python) to speed things up, you are doing it wrong! :)

+3
source

I think your second one may be faster in this particular case, because you iterate through a smaller data set (small amount of work) and then only get access to a few components of a larger data set (efficient work thanks to pandas developers).

However, with the .add method, pandas should look at all indexes.

If df1 and df2 are the same length, your first method takes 54 ms, but the second method takes> 2 minutes (on my machine, obviously, YMMV).

+1
source

Source: https://habr.com/ru/post/957632/


All Articles