How to combine two pandas frames in parallel (multithreading or multiprocessing)

Question

How to combine two pandas frames in parallel (multithreading or multiprocessing)

Without parallel programming, I can merge the left and right data in the key column using the code below, but it will be too slow, since both are very large. is there any way i can do this parallelize efficiently?

I have 64 cores, and so I can use 63 of them to combine these two data blocks.

 left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}) result = pd.merge(left, right, on='key')

the output will be:

 left: AB key 0 A0 B0 K0 1 A1 B1 K1 2 A2 B2 K2 3 A3 B3 K3 right: CD key 0 C0 D0 K0 1 C1 D1 K1 2 C2 D2 K2 3 C3 D3 K3 result: AB key CD 0 A0 B0 K0 C0 D0 1 A1 B1 K1 C1 D1 2 A2 B2 K2 C2 D2 3 A3 B3 K3 C3 D3

I want to do this in parallel so that I can do it with speed.

+5

python multithreading pandas parallel-processing multiprocessing

contactlp Mar 03 '16 at 23:25

source share

2 answers

jezrael · Answer 1 · 2016-03-04T07:02:09+0000

I believe you can use dask . and the merge function.

The docs say:

What definitely works?

Skillfully parallelizable operations (also fast):
Join index: dd.merge (df1, df2, left_index = True, right_index = True)

Or:

Operations requiring shuffling (slow-ish, if only by index)
Set Index: df.set_index (df.x)
Join non-index: pd.merge (df1, df2, on = 'name')

You can also check how to Create Dask DataFrames .

Example

 import pandas as pd left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}) result = pd.merge(left, right, on='key') print result AB key CD 0 A0 B0 K0 C0 D0 1 A1 B1 K1 C1 D1 2 A2 B2 K2 C2 D2 3 A3 B3 K3 C3 D3 import dask.dataframe as dd #Construct a dask objects from a pandas objects left1 = dd.from_pandas(left, npartitions=3) right1 = dd.from_pandas(right, npartitions=3) #merge on key print dd.merge(left1, right1, on='key').compute() AB key CD 0 A3 B3 K3 C3 D3 1 A1 B1 K1 C1 D1 0 A2 B2 K2 C2 D2 1 A0 B0 K0 C0 D0

 #first set indexes and then merge by them print dd.merge(left1.set_index('key').compute(), right1.set_index('key').compute(), left_index=True, right_index=True) ABCD key K0 A0 B0 C0 D0 K1 A1 B1 C1 D1 K2 A2 B2 C2 D2 K3 A3 B3 C3 D3

Gustavo bezerra · Answer 2 · 2016-03-03T23:41:23+0000

You can increase the speed (about 3 in this example) of your merge by making an index of your data instead of the key column and using join .

 left2 = left.set_index('key') right2 = right.set_index('key') In [46]: %timeit result2 = left2.join(right2) 1000 loops, best of 3: 361 µs per loop In [47]: %timeit result = pd.merge(left, right, on='key') 1000 loops, best of 3: 1.01 ms per loop

How to combine two pandas frames in parallel (multithreading or multiprocessing)

More articles: