Daskan Speeches: Workflow Issues

I am confused about how to get the best from dask.

Problem I have a dataframe that contains several time series (each has its own key), and I need to run a function my_funfor each of them. One way to solve this problem is with pandas df = list(df.groupby("key")), and then apply my_fun  with multiprocessing. The performances, despite the huge use of RAM, are pretty good on my machine and terrible at computing the Google cloud.

In Dask, my current workflow is:

import dask.dataframe as dd
from dask.multiprocessing import get
  • Reading data from S3. 14 files → 14 sections
  • `df.groupby (" key "). Apply (my_fun) .to_frame.compute (arrive = arrive)

Since I did not set indexes df.known_divisions False

The resulting graph enter image description here and I do not understand that what I see is a bottleneck or not.

Questions:

  • Is it better to have df.npartitionsas a multiple ncpu, or does it not matter?
  • From this, it seems better to set the index as the key. I suppose I can do something like

    df ["key2"] = df ["key"] df = df.set_index ("key2")

but, again, I do not know if this is the best for this.

+4
source share
1 answer

, ", " Dask, "" - /, , , .

, , , , . , . "--", , . .

, , , 9 8 ( 8 , , , , ); dask, , . .

+3

Source: https://habr.com/ru/post/1690239/


All Articles