How to execute multithreaded `merge ()` with dask? How to use multicore kernels through qsub?

Question

How to execute multithreaded `merge ()` with dask? How to use multicore kernels through qsub?

I just started using dask, and I'm still confused about how to do simple pandas tasks with multiple threads or using a cluster.

Take pandas.merge()with daskdata frames.

import dask.dataframe as dd

df1 = dd.read_csv("file1.csv")
df2 = dd.read_csv("file2.csv")

df3 = dd.merge(df1, df2)

Now let's say I had to run this on my laptop with 4 cores. How to assign 4 threads to this task?

It seems the right way to do this:

dask.set_options(get=dask.threaded.get)
df3 = dd.merge(df1, df2).compute()

And it will use as many threads (i.e. how many cores with shared memory on your laptop exist, 4)? How to set the number of threads?

Say I'm on an object with 100 cores. How can I send this the same as if you were sending jobs to the cluster using qsub? (Similar to running cluster tasks through MPI?)

dask.set_options(get=dask.threaded.get)
df3 = dd.merge(df1, df2).compute

+4

python multithreading pandas cluster-computing dask

ShanZhengYang 14 . '16 22:40

1

MRocklin · Accepted Answer · 2016-10-15T01:55:35+0000

Dask.dataframe , .

, .compute().

dask.distributed dask- . qsub - dask-scheduler :

$ dask-scheduler
Scheduler started at 192.168.1.100:8786

qsub dask-worker, :

$ qsub dask-worker 192.168.1.100:8786 ... <various options>

, DRMAA ( SGE/qsub- ): https://github.com/dask/dask-drmaa

dask.distributed.Client,

from dask.distributed import Client
c = Client('192.168.1.100:8786')  # now computations run by default on the cluster

, Pandas 0.19 GIL pd.merge, . , : https://github.com/pandas-dev/pandas/issues/13745

How to execute multithreaded `merge ()` with dask? How to use multicore kernels through qsub?

More articles: