Why is np.where faster than pd.apply

Sample code here

import pandas as pd import numpy as np df = pd.DataFrame({'Customer' : ['Bob', 'Ken', 'Steve', 'Joe'], 'Spending' : [130,22,313,46]}) #[400000 rows x 4 columns] df = pd.concat([df]*100000).reset_index(drop=True) In [129]: %timeit df['Grade']= np.where(df['Spending'] > 100 ,'A','B') 10 loops, best of 3: 21.6 ms per loop In [130]: %timeit df['grade'] = df.apply(lambda row: 'A' if row['Spending'] > 100 else 'B', axis = 1) 1 loop, best of 3: 7.08 s per loop 

Question taken from here: stack overflow

+5
source share
2 answers

I think np.where faster because the numpy array and pandas vector are built on this array.

df.apply slow because it uses loops .

Operations

vectorize are the fastest, then cython routines , and then apply .

See this answer with pandas developer best explanation - Jeff .

+7
source

Just adding a visualization approach to what was said.

Profile and total total time df.apply : df.apply profile

We see that the cumulative time is 13.8s .

Profile and total np.where total time: np.where profile

Here the cumulative time is 5.44ms , which is 2500 times faster than df.apply

The figure above was obtained using the snakeviz library. Here is a link to the library.

SnakeViz displays profiles as sunbeams, whose functions are represented as arcs. The root function is the circle in the middle, with the functions that it calls, the functions that will call these functions, and so on. The amount of time spent inside the function is represented by the width of the angular arc. An arc that flows around most of the circle is a function that takes up most of the time of its calling function, while a lean arc is a function that uses almost no time.

0
source

Source: https://habr.com/ru/post/1261337/


All Articles