OK, so itβs wrong here that every task contains an array of numpy x
, which is large. For each of the 100 tasks that we send, we need to serialize x
, send it to the scheduler, send it to the employee, etc.
Instead, we will send the array to the cluster once:
[future] = c.scatter([x])
Now future
is a token pointing to the x
array that lives in the cluster. Now we can send tasks related to this remote future, instead of the numpy array on our local client.
# futures = [c.submit(f, x, param) for param in params]
Now it is much faster and more efficient in managing Dask data.
Scatter data for all workers
If you expect that you will need to move the x array to all employees, then you may need to translate the array to run
[future] = c.scatter([x], broadcast=True)
Use Dask Delayed
Futures work great with dask.delayed. There is no performance benefit here, but some people prefer this interface:
# futures = [c.submit(f, future, param) for param in params] from dask import delayed lazy_values = [delayed(f)(future, param) for param in params] futures = c.compute(lazy_values)
source share