Strategy for efficiently separating file data frames

The documentation for Dask talks about redistribution to reduce overhead here .

However, they seem to indicate that you need to know in advance what your data frame will look like (i.e., 1/100 of the expected data is expected).

Is there a good way to reasonably redistribute without making assumptions? For the time being, I'm just redistributing with npartitions = ncores * magic_number and setting it to True to expand sections if necessary. This one size fits all approaches, but it is definitely not optimal, as my dataset varies in size.

The data is time series data, but, unfortunately, not at regular intervals, in the past I used redistribution in time frequency, but this would not be optimal due to how irregular the data is (sometimes nothing happens for minutes, not thousands in seconds)

+10
source share
2 answers

After discussing with mrocklin, a decent partitioning strategy is to focus on 100 MB partition size, managed by df.memory_usage().sum().compute() . With data sets that fit in RAM, the extra work that this may require can be mitigated using df.persist() located at the appropriate points.

+5
source

Just to add to Samantha Hughes's answer:

memory_usage() by default ignores memory consumption by dtype columns of an object. For the data sets I've been working with lately, this underestimates memory usage by about 10 times.

If you are not sure that there are no dtype columns of the object, I would suggest specifying deep=True , i.e. reallocation using:

df.repartition(npartitions= 1+df.memory_usage(deep=True).sum().compute()//n )

Where n is the size of the target partition in bytes. Appendix 1 ensures that the number of partitions is always greater than 1 ( // performs division by floor).

+4
source

Source: https://habr.com/ru/post/1269038/


All Articles