I do multiprocessing in a pandas data framework, splitting it into several data frames, which are stored as a list. And using Pool.map(), I pass the dataframe to a specific function. My input file has a value of "300 mb", so the small numerical frames are approximately equal to "75 mb". But with multiprocessor operation, memory consumption increases by 7 GB, and each local process consumes about approx. 2 GB of memory. Why is this happening?
def main():
my_df = pd.read_table("my_file.txt", sep="\t")
my_df = my_df.groupby('someCol')
my_df_list = []
for colID, colData in my_df:
my_df_list.append(colData)
p = Pool(3)
result = p.map(process_df, my_df_list)
p.close()
p.join()
print('Global maximum memory usage: %.2f (mb)' % current_mem_usage())
result_merged = pd.concat(result)
def process_df(my_df):
my_new_df = do something with "my_df"
print('\tWorker maximum memory usage: %.2f (mb)' % (current_mem_usage()))
del my_df
return my_new_df
def current_mem_usage():
return resource.getrusage(resource.RUSAGE_SELF).ru_maxrss / 1024.
My results are good, but memory consumption is quite large for every 75 megabyte file. Why is that? Is this a leak? What are the possible remedies?
Memory usage output:
Worker maximum memory usage: 2182.84 (mb)
Worker maximum memory usage: 2182.84 (mb)
Worker maximum memory usage: 2837.69 (mb)
Worker maximum memory usage: 2849.84 (mb)
Global maximum memory usage: 3106.00 (mb)