Parallelize pandas apply

New to pandas, I already want to parallelize the application operation next door. So far I have found that Parallelize is applied after pandas groupby. However, this only works for grouped data frames.

My use case is different: I have a list of holidays, and for my current line / date you want to find minus days before and after this day until the next vacation.

This is the function I call through apply:

def get_nearest_holiday(x, pivot): nearestHoliday = min(x, key=lambda x: abs(x- pivot)) difference = abs(nearesHoliday - pivot) return difference / np.timedelta64(1, 'D') 

How can I speed it up?

Edit

I experimented a bit with pythons pools - but it was neither good code nor my calculated results.

+4
source share
2 answers

I think going down the path of concurrent use attempts is probably overly complicated. I have not tried this approach on a large sample, so your mileage may vary, but it should give you an idea ...

Let's start with some dates ...

 import pandas as pd dates = pd.to_datetime(['2016-01-03', '2016-09-09', '2016-12-12', '2016-03-03']) 

We will use some rest data from pandas.tseries.holiday - note that essentially we want DatetimeIndex ...

 from pandas.tseries.holiday import USFederalHolidayCalendar holiday_calendar = USFederalHolidayCalendar() holidays = holiday_calendar.holidays('2016-01-01') 

This gives us:

 DatetimeIndex(['2016-01-01', '2016-01-18', '2016-02-15', '2016-05-30', '2016-07-04', '2016-09-05', '2016-10-10', '2016-11-11', '2016-11-24', '2016-12-26', ... '2030-01-01', '2030-01-21', '2030-02-18', '2030-05-27', '2030-07-04', '2030-09-02', '2030-10-14', '2030-11-11', '2030-11-28', '2030-12-25'], dtype='datetime64[ns]', length=150, freq=None) 

Now we find the nearest holiday indices for the source dates using searchsorted :

 indices = holidays.searchsorted(dates) # array([1, 6, 9, 3]) next_nearest = holidays[indices] # DatetimeIndex(['2016-01-18', '2016-10-10', '2016-12-26', '2016-05-30'], dtype='datetime64[ns]', freq=None) 

Then take the difference between the two:

 next_nearest_diff = pd.to_timedelta(next_nearest.values - dates.values).days # array([15, 31, 14, 88]) 

You need to be careful with the indices so that you don't wrap yourself around, and on the previous date do the calculation using indices - 1 , but it should act like (I hope) a relatively good base.

+2
source

For a parallel approach, this is an answer based on the Parallelize apply after pandas groupby :

 from joblib import Parallel, delayed import multiprocessing def get_nearest_dateParallel(df): df['daysBeforeHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day < x], x)) df['daysAfterHoliday'] = df.myDates.apply(lambda x: get_nearest_date(holidays.day[holidays.day > x], x)) return df def applyParallel(dfGrouped, func): retLst = Parallel(n_jobs=multiprocessing.cpu_count())(delayed(func)(group) for name, group in dfGrouped) return pd.concat(retLst) print ('parallel version: ') # 4 min 30 seconds %time result = applyParallel(datesFrame.groupby(datesFrame.index), get_nearest_dateParallel) 

but I prefer the @NinjaPuppy approach because it does not require O (n * number_of_holidays)

0
source

Source: https://habr.com/ru/post/1012600/


All Articles