Why is np.where faster than pd.apply

Question

Why is np.where faster than pd.apply

Sample code here

import pandas as pd import numpy as np df = pd.DataFrame({'Customer' : ['Bob', 'Ken', 'Steve', 'Joe'], 'Spending' : [130,22,313,46]}) #[400000 rows x 4 columns] df = pd.concat([df]*100000).reset_index(drop=True) In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B') 10 loops, best of 3: 21.6 ms per loop In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1) 1 loop, best of 3: 7.08 s per loop

Question taken from here: stack overflow

+5

python numpy pandas

Vikash Singh Dec 15 '16 at 14:16

source share

2 answers

Just adding a visualization approach to what was said.

Profile and total total time df.apply :

We see that the cumulative time is 13.8s .

Profile and total np.where total time:

Here the cumulative time is 5.44ms , which is 2500 times faster than df.apply

The figure above was obtained using the snakeviz library. Here is a link to the library.

SnakeViz displays profiles as sunbeams, whose functions are represented as arcs. The root function is the circle in the middle, with the functions that it calls, the functions that will call these functions, and so on. The amount of time spent inside the function is represented by the width of the angular arc. An arc that flows around most of the circle is a function that takes up most of the time of its calling function, while a lean arc is a function that uses almost no time.

0

MMF Dec 15 '16 at 14:39

source share

jezrael · Accepted Answer · 2016-12-15T14:21:26+0000

I think np.where faster because the numpy array and pandas vector are built on this array.

df.apply slow because it uses loops .

Operations

vectorize are the fastest, then cython routines , and then apply .

See this answer with pandas developer best explanation - Jeff .

Why is np.where faster than pd.apply

More articles: