The documentation for Dask talks about redistribution to reduce overhead here .
However, they seem to indicate that you need to know in advance what your data frame will look like (i.e., 1/100 of the expected data is expected).
Is there a good way to reasonably redistribute without making assumptions? For the time being, I'm just redistributing with npartitions = ncores * magic_number and setting it to True to expand sections if necessary. This one size fits all approaches, but it is definitely not optimal, as my dataset varies in size.
The data is time series data, but, unfortunately, not at regular intervals, in the past I used redistribution in time frequency, but this would not be optimal due to how irregular the data is (sometimes nothing happens for minutes, not thousands in seconds)
source share