More Pythonic / Pandaic cycle approach over pandas series

Question

More Pythonic / Pandaic cycle approach over pandas series

This is most likely something very basic, but I cannot understand it. Suppose I have a series like this:

s1 = pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4])

How can I perform operations on the sub-series of this series without returning to using the for loop?

Suppose, for example, that I want to turn it into a new series containing four elements. The first element of this new series is the sum of the first three elements in the original series (1, 1, 1), the second is the sum of the second three (2, 2, 2), etc .:

 s2 = pd.Series([3, 6, 9, 12])

How can i do this?

+6

python loops numpy pandas

rdv Jan 05 '17 at 12:44

source share

4 answers

Here's the NumPy method using np.bincount to handle the total number of elements -

 pd.Series(np.bincount(np.arange(s1.size)//3, s1))

Run Example -

 In [42]: s1 = pd.Series([1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4, 9, 5]) In [43]: pd.Series(np.bincount(np.arange(s1.size)//3, s1)) Out[43]: 0 3.0 1 6.0 2 9.0 3 12.0 4 14.0 dtype: float64

If we really strive for performance, and in the case where the series length is divided by the window length, we can get a series s1.values with s1.values , then reshape and finally use np.einsum to summarize, for example:

 pd.Series(np.einsum('ij->i',s.values.reshape(-1,3)))

Dates with the same dataset as @Nickil Maveli post -

 In [140]: s = pd.Series(np.repeat(np.arange(10**5), 3)) # @Nickil Maveli soln In [141]: %timeit pd.Series(np.add.reduceat(s.values, np.arange(0, s.shape[0], 3))) 100 loops, best of 3: 2.07 ms per loop # Using views+sum In [142]: %timeit pd.Series(s.values.reshape(-1,3).sum(1)) 100 loops, best of 3: 2.03 ms per loop # Using views+einsum In [143]: %timeit pd.Series(np.einsum('ij->i',s.values.reshape(-1,3))) 1000 loops, best of 3: 1.04 ms per loop

+5

Divakar Jan 05 '17 at 13:06

source share

You can also use np.add.reduceat , specifying the cuts that will be reduced on every third element, and calculate their current amount:

 >>> pd.Series(np.add.reduceat(s1.values, np.arange(0, s1.shape[0], 3))) 0 3 1 6 2 9 3 12 dtype: int64

Dates:

 arr = np.repeat(np.arange(10**5), 3) s = pd.Series(arr) s.shape (300000,) # @IanS soln %timeit s.rolling(3).sum()[2::3] 100 loops, best of 3: 15.6 ms per loop # @Divakar soln %timeit pd.Series(np.bincount(np.arange(s.size)//3, s)) 100 loops, best of 3: 5.44 ms per loop # @Nikolas Rieble soln %timeit pd.Series(np.sum(np.array(s).reshape(len(s)/3,3), axis = 1)) 100 loops, best of 3: 2.17 ms per loop # @Nikolas Rieble modified soln %timeit pd.Series(np.sum(np.array(s).reshape(-1, 3), axis=1)) 100 loops, best of 3: 2.15 ms per loop # @Divakar modified soln %timeit pd.Series(s.values.reshape(-1,3).sum(1)) 1000 loops, best of 3: 1.62 ms per loop # Proposed solution in post %timeit pd.Series(np.add.reduceat(s.values, np.arange(0, s.shape[0], 3))) 1000 loops, best of 3: 1.45 ms per loop

+5

Nickil maveli Jan 05 '17 at 13:10

source share

This calculates the current amount:

 s1.rolling(3).sum()

You just need to select every third element:

 s1.rolling(3).sum()[2::3]

Output:

 2 3.0 5 6.0 8 9.0 11 12.0

+2

Ians Jan 05 '17 at 12:51

source share

Nikolas Rieble · Accepted Answer · 2017-01-05T12:51:35+0000

You can change the s1 series with numpy and then summarize line by line, for example:

 np.sum(np.array(s1).reshape(len(s1)/3,3), axis = 1)

that leads to

 array([ 3, 6, 9, 12], dtype=int64)

EDIT: like MSeifert mentioned in his comment, you can also let numpy calculate the length, such as:

 np.sum(np.array(s1).reshape(-1, 3), axis=1)

More Pythonic / Pandaic cycle approach over pandas series

More articles: