Why does using multiprocessing using pandas lead to such a sharp acceleration?

Question

Why does using multiprocessing using pandas lead to such a sharp acceleration?

Suppose I have a pandas framework and a function that I would like to apply to each line. I can call df.apply(apply_fn, axis=1) , which should take linear time in size df . Or I can split df and use pool.map to call my function on each part, and then combine the results.

I expected that the acceleration coefficient using pool.map will be approximately equal to the number of processes in the pool (new_execution_time = original_execution_time / N if N processors are used, and this assumes zero overhead).

Instead, in the toy example, time drops to 2% (0.005272 / 0.230757) when using 4 processors. I was expecting 25% at best. What is happening and what I don’t understand?

 import numpy as np from multiprocessing import Pool import pandas as pd import pdb import time n = 1000 variables = {"hello":np.arange(n), "there":np.random.randn(n)} df = pd.DataFrame(variables) def apply_fn(series): return pd.Series({"col_5":5, "col_88":88, "sum_hello_there":series["hello"] + series["there"]}) def call_apply_fn(df): return df.apply(apply_fn, axis=1) n_processes = 4 # My machine has 4 CPUs pool = Pool(processes=n_processes) t0 = time.process_time() new_df = df.apply(apply_fn, axis=1) t1 = time.process_time() df_split = np.array_split(df, n_processes) pool_results = pool.map(call_apply_fn, df_split) new_df2 = pd.concat(pool_results) t2 = time.process_time() new_df3 = df.apply(apply_fn, axis=1) # Try df.apply a second time t3 = time.process_time() print("identical results: %s" % np.all(np.isclose(new_df, new_df2))) # True print("t1 - t0 = %f" % (t1 - t0)) # I got 0.230757 print("t2 - t1 = %f" % (t2 - t1)) # I got 0.005272 print("t3 - t2 = %f" % (t3 - t2)) # I got 0.229413

I saved the code above and ran it using python3 my_filename.py .

PS I understand that in this toy example new_df can be created in a much simpler way, without using an application. I am interested in applying similar code with the more complex apply_fn , which doesn’t just add columns.

+5

performance python pandas multiprocessing dataframe

Adrian Apr 12 '16 at 14:50

source share

1 answer

ptrj · Accepted Answer · 2016-04-13T03:26:35+0000

Edit (my previous answer was actually incorrect.)

time.process_time() ( doc ) measures time only in the current process (and does not include sleep time). Therefore, the time spent on child processes is not taken into account.

I run your code with time.time() , which measures real time (showing no acceleration) and more reliable timeit.timeit (about 50% acceleration). I have 4 cores.

Why does using multiprocessing using pandas lead to such a sharp acceleration?

More articles: