I am confused about how to get the best from dask.
Problem
I have a dataframe that contains several time series (each has its own key), and I need to run a function my_funfor each of them. One way to solve this problem is with pandas
df = list(df.groupby("key")), and then apply my_fun
with multiprocessing. The performances, despite the huge use of RAM, are pretty good on my machine and terrible at computing the Google cloud.
In Dask, my current workflow is:
import dask.dataframe as dd
from dask.multiprocessing import get
- Reading data from S3. 14 files → 14 sections
- `df.groupby (" key "). Apply (my_fun) .to_frame.compute (arrive = arrive)
Since I did not set indexes df.known_divisions False
The resulting graph
and I do not understand that what I see is a bottleneck or not.
Questions:
- Is it better to have
df.npartitionsas a multiple ncpu, or does it not matter? From this, it seems better to set the index as the key. I suppose I can do something like
df ["key2"] = df ["key"] df = df.set_index ("key2")
but, again, I do not know if this is the best for this.
source
share