Suppose I have a pandas framework and a function that I would like to apply to each line. I can call df.apply(apply_fn, axis=1) , which should take linear time in size df . Or I can split df and use pool.map to call my function on each part, and then combine the results.
I expected that the acceleration coefficient using pool.map will be approximately equal to the number of processes in the pool (new_execution_time = original_execution_time / N if N processors are used, and this assumes zero overhead).
Instead, in the toy example, time drops to 2% (0.005272 / 0.230757) when using 4 processors. I was expecting 25% at best. What is happening and what I donβt understand?
import numpy as np from multiprocessing import Pool import pandas as pd import pdb import time n = 1000 variables = {"hello":np.arange(n), "there":np.random.randn(n)} df = pd.DataFrame(variables) def apply_fn(series): return pd.Series({"col_5":5, "col_88":88, "sum_hello_there":series["hello"] + series["there"]}) def call_apply_fn(df): return df.apply(apply_fn, axis=1) n_processes = 4
I saved the code above and ran it using python3 my_filename.py .
PS I understand that in this toy example new_df can be created in a much simpler way, without using an application. I am interested in applying similar code with the more complex apply_fn , which doesnβt just add columns.
source share