I just started using dask, and I'm still confused about how to do simple pandas tasks with multiple threads or using a cluster.
Take pandas.merge()with daskdata frames.
import dask.dataframe as dd
df1 = dd.read_csv("file1.csv")
df2 = dd.read_csv("file2.csv")
df3 = dd.merge(df1, df2)
Now let's say I had to run this on my laptop with 4 cores. How to assign 4 threads to this task?
It seems the right way to do this:
dask.set_options(get=dask.threaded.get)
df3 = dd.merge(df1, df2).compute()
And it will use as many threads (i.e. how many cores with shared memory on your laptop exist, 4)? How to set the number of threads?
Say I'm on an object with 100 cores. How can I send this the same as if you were sending jobs to the cluster using qsub? (Similar to running cluster tasks through MPI?)
dask.set_options(get=dask.threaded.get)
df3 = dd.merge(df1, df2).compute