Is there a way to get the largest items for each group in dask?

Question

Is there a way to get the largest items for each group in dask?

I have the following dataset:

location  category    percent
A         5           100.0
B         3           100.0
C         2            50.0
          4            13.0
D         2            75.0
          3            59.0
          4            13.0
          5             4.0

And I'm trying to get the youngest elements of a category in a dataframe, grouped by location. those. if I want the top 2 highest percentages for each group to be as follows:

location  category    percent
A         5           100.0
B         3           100.0
C         2            50.0
          4            13.0
D         2            75.0
          3            59.0

It seems that in pandas this is relatively straightforward using pandas.core.groupby.SeriesGroupBy.nlargest, but dask has no function nlargestfor groupby. Played with apply, but could not get it to work correctly.

df.groupby(['location'].apply(lambda x: x['percent'].nlargest(2)).compute()

But I just get the error message ValueError: Wrong number of items passed 0, placement implies 8

+4

pandas grouping dask

whisperstream Nov 10 '17 at 17:06

source share

1 answer

Andy Hayden · Accepted Answer · 2017-11-10T17:24:44+0000

The application should work, but your syntax doesn't work a bit:

In [11]: df
Out[11]:
Dask DataFrame Structure:
              Unnamed: 0 location category  percent
npartitions=1
                   int64   object    int64  float64
                     ...      ...      ...      ...
Dask Name: from-delayed, 3 tasks

In [12]: df.groupby("location")["percent"].apply(lambda x: x.nlargest(2), meta=('x', 'f8')).compute()
Out[12]:
location
A         0    100.0
B         1    100.0
C         2     50.0
          3     13.0
D         4     75.0
          5     59.0
Name: x, dtype: float64

pandas .nlargest .rank groupby, :

In [21]: df1
Out[21]:
  location  category  percent
0        A         5    100.0
1        B         3    100.0
2        C         2     50.0
3        C         4     13.0
4        D         2     75.0
5        D         3     59.0
6        D         4     13.0
7        D         5      4.0

In [22]: df1.groupby("location")["percent"].nlargest(2)
Out[22]:
location
A         0    100.0
B         1    100.0
C         2     50.0
          3     13.0
D         4     75.0
          5     59.0
Name: percent, dtype: float64

dask:

Dask.dataframe , API pandas.
:
API pandas
(, ).

Is there a way to get the largest items for each group in dask?

More articles: