Compute pandas data file at the same time

Question

Compute pandas data file at the same time

Is it possible to perform multiple batch calculations in a data area in pandas at the same time and return these results? So, I would like to compute the following data sets and get these results one by one and finally merge them into one data frame.

df_a = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["height"])) df_b = df.groupby(["state", "person"]).apply(lambda x: np.mean(x["weight"])) df_c = df.groupby(["state", "person"]).apply(lambda x: xp["number"].sum())

And then,

 df_final = merge(df_a, df_b) # omitting the irrelevant part

However, as far as I know, these functions in multiprocessing do not meet my needs here, since it is more like running several functions at the same time that do not return internal, local variables, and instead just print some result inside the function (for example, the often used is_prime function ) or at the same time run a separate function with different sets of arguments (eg, map function in multiprocessing ), if I understand it correctly (I'm not sure I understand it correctly, so correct me if I'm on ibayus!).

However, what I would like to implement simply runs these three (and actually more) at the same time and finally brings them together as soon as all these calculations in the dataframe are successfully completed. I assume that the functions implemented in Go ( goroutines and channels ), possibly creating each function accordingly, running them one at a time, waiting for them to complete and finally combining them.

So how can this be written in Python? I read the multiprocessing , threading and concurrent.futures documentation, but they are all too elusive for me, which I don’t even understand if I can use these libraries to start ...

(I made the code accurate in terms of brevity, and the actual code is more complex, so please don't answer "Yes, you can write it on one line and not at the same time" or something like that.)

Thanks.

+6

python multithreading concurrency multiprocessing

Blaszard Nov 08 '13 at 0:32

source share

1 answer

James tobin · Accepted Answer · 2014-08-12T20:27:38+0000

9 months later, and this is still one of the best results for working with multiprocessing and pandas. I hope you have found some kind of answer at the moment, but if not, I have one that seems to work, and hopefully this will help others who are considering this issue.

 import pandas as pd import numpy as np #sample data df = pd.DataFrame([[1,2,3,1,2,3,1,2,3,1],[2,2,2,2,2,2,2,2,2,2],[1,3,5,7,9,2,4,6,8,0],[2,4,6,8,0,1,3,5,7,9]]).transpose() df.columns=['a','b','c','d'] df abcd 0 1 2 1 2 1 2 2 3 4 2 3 2 5 6 3 1 2 7 8 4 2 2 9 0 5 3 2 2 1 6 1 2 4 3 7 2 2 6 5 8 3 2 8 7 9 1 2 0 9 #this one function does the three functions you had used in your question, obviously you could add more functions or different ones for different groupby things def f(x): return [np.mean(x[1]['c']),np.mean(x[1]['d']),x[1]['d'].sum()] #sets up a pool with 4 cpus from multiprocessing import Pool pool = Pool(4) #runs the statistics you wanted on each group group_df = pd.DataFrame(pool.map(f,df.groupby(['a','b']))) group_df 0 1 2 0 3 5.500000 22 1 6 3.000000 9 2 5 4.666667 14 group_df['keys']=df.groupby(['a','b']).groups.keys() group_df 0 1 2 keys 0 3 5.500000 22 (1, 2) 1 6 3.000000 9 (3, 2) 2 5 4.666667 14 (2, 2)

At least I hope this helps someone who is looking at this material in the future.

Compute pandas data file at the same time

More articles: