Daskan Speeches: Workflow Issues

Question

Daskan Speeches: Workflow Issues

I am confused about how to get the best from dask.

Problem I have a dataframe that contains several time series (each has its own key), and I need to run a function my_funfor each of them. One way to solve this problem is with pandas df = list(df.groupby("key")), and then apply my_fun with multiprocessing. The performances, despite the huge use of RAM, are pretty good on my machine and terrible at computing the Google cloud.

In Dask, my current workflow is:

import dask.dataframe as dd
from dask.multiprocessing import get

Reading data from S3. 14 files → 14 sections
`df.groupby (" key "). Apply (my_fun) .to_frame.compute (arrive = arrive)

Since I did not set indexes df.known_divisions False

The resulting graph and I do not understand that what I see is a bottleneck or not.

Questions:

Is it better to have df.npartitionsas a multiple ncpu, or does it not matter?
From this, it seems better to set the index as the key. I suppose I can do something like
df ["key2"] = df ["key"] df = df.set_index ("key2")

but, again, I do not know if this is the best for this.

+4

dask dask-distributed

user32185 Dec 04 '17 at 19:20

source share

1 answer

mdurant · Answer 1 · 2017-12-08T19:47:57+0000

, ", " Dask, "" - /, , , .

, , , , . , . "--", , . .

, , , 9 8 ( 8 , , , , ); dask, , . .

Daskan Speeches: Workflow Issues

More articles: