How to use pandas.cut () (or equivalent) effectively in dask?

Is there an equivalent pandas.cut () in Dask?

I am trying to use bin and group a large dataset in Python. This is a list of measured electrons with properties (positionX, positionY, energy, time). I need to group it by position X, positionY and binning in energy classes.

So far I could do this with pandas, but I would like to run it in parallel. So I'm trying to use dask.

The groupby method works very well, but, unfortunately, I have difficulty trying to collect data into energy. I found a solution using pandas.cut (), but to call compute () on an raw data set (turning it into non-parallel code) you need to call. Is there an equivalent pandas.cut () in dask, or is there another (elegant) way to achieve the same functionality?

import dask # create dask dataframe from the array dd = dask.dataframe.from_array(mainArray, chunksize=100000, columns=('posX','posY', 'time', 'energy')) # Set the bins to bin along energy bins = range(0, 10000, 500) # Create the cut in energy (using non-parallel pandas code...) energyBinner=pandas.cut(dd['energy'],bins) # Group the data according to posX, posY and energy grouped = dd.compute().groupby([energyBinner, 'posX', 'posY']) # Apply the count() method to the data: numberOfEvents = grouped['time'].count() 

Thanks a lot!

+6
source share
1 answer

You can do dd['energy'].map_partitions(pd.cut, bins) .

+2
source

Source: https://habr.com/ru/post/1015252/


All Articles