How to combine two pandas frames in parallel (multithreading or multiprocessing)

Without parallel programming, I can merge the left and right data in the key column using the code below, but it will be too slow, since both are very large. is there any way i can do this parallelize efficiently?

I have 64 cores, and so I can use 63 of them to combine these two data blocks.

 left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}) result = pd.merge(left, right, on='key') 

the output will be:

 left: AB key 0 A0 B0 K0 1 A1 B1 K1 2 A2 B2 K2 3 A3 B3 K3 right: CD key 0 C0 D0 K0 1 C1 D1 K1 2 C2 D2 K2 3 C3 D3 K3 result: AB key CD 0 A0 B0 K0 C0 D0 1 A1 B1 K1 C1 D1 2 A2 B2 K2 C2 D2 3 A3 B3 K3 C3 D3 

I want to do this in parallel so that I can do it with speed.

+5
source share
2 answers

I believe you can use dask . and the merge function.

The docs say:

What definitely works?

Skillfully parallelizable operations (also fast):

Join index: dd.merge (df1, df2, left_index = True, right_index = True)

Or:

Operations requiring shuffling (slow-ish, if only by index)

Set Index: df.set_index (df.x)

Join non-index: pd.merge (df1, df2, on = 'name')

You can also check how to Create Dask DataFrames .

Example

 import pandas as pd left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'A': ['A0', 'A1', 'A2', 'A3'], 'B': ['B0', 'B1', 'B2', 'B3']}) right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'], 'C': ['C0', 'C1', 'C2', 'C3'], 'D': ['D0', 'D1', 'D2', 'D3']}) result = pd.merge(left, right, on='key') print result AB key CD 0 A0 B0 K0 C0 D0 1 A1 B1 K1 C1 D1 2 A2 B2 K2 C2 D2 3 A3 B3 K3 C3 D3 import dask.dataframe as dd #Construct a dask objects from a pandas objects left1 = dd.from_pandas(left, npartitions=3) right1 = dd.from_pandas(right, npartitions=3) #merge on key print dd.merge(left1, right1, on='key').compute() AB key CD 0 A3 B3 K3 C3 D3 1 A1 B1 K1 C1 D1 0 A2 B2 K2 C2 D2 1 A0 B0 K0 C0 D0 
 #first set indexes and then merge by them print dd.merge(left1.set_index('key').compute(), right1.set_index('key').compute(), left_index=True, right_index=True) ABCD key K0 A0 B0 C0 D0 K1 A1 B1 C1 D1 K2 A2 B2 C2 D2 K3 A3 B3 C3 D3 
+5
source

You can increase the speed (about 3 in this example) of your merge by making an index of your data instead of the key column and using join .

 left2 = left.set_index('key') right2 = right.set_index('key') In [46]: %timeit result2 = left2.join(right2) 1000 loops, best of 3: 361 Β΅s per loop In [47]: %timeit result = pd.merge(left, right, on='key') 1000 loops, best of 3: 1.01 ms per loop 
+3
source

Source: https://habr.com/ru/post/1244394/


All Articles