Pandas is the best way to sort, group, summarize

New in Pandas, so we wonder if there is still a Pandithic (coining it!) Method for sorting some data, grouping it and its subsequent part. The problem is to find the 3 largest values ​​in a series of values, and then sum them up only.

census_cp is a dataframe with state district information. My current solution:

cen_sort = census_cp.groupby('STNAME').head(3)
cen_sort = cen_sort.groupby('STNAME').sum().sort_values(by='CENSUS2010POP', ascending=False).head(n=3)
cen_sort = cen_sort.reset_index()
print(cen_sort['STNAME'].values.tolist())

I am particularly curious if there is a better way to do this, and also why I cannot put the sum at the end of the previous line and combine together what seems to me to be clearly related elements (get the top 3 from each and add them together).

+1
source share
1 answer

, head sum groupby, nlargest:

df = census_cp.groupby('STNAME')
              .apply(lambda x: x.head(3).sum(numeric_only=True))
              .reset_index()
              .nlargest(3, 'CENSUS2010POP')

:

census_cp = pd.DataFrame({'STNAME':list('abscscbcdbcsscae'),
                   'CENSUS2010POP':[4,5,6,5,6,2,3,4,5,6,4,5,4,3,6,5]})

print (census_cp)
    CENSUS2010POP STNAME
0               4      a
1               5      b
2               6      s
3               5      c
4               6      s
5               2      c
6               3      b
7               4      c
8               5      d
9               6      b
10              4      c
11              5      s
12              4      s
13              3      c
14              6      a
15              5      e


df = census_cp.groupby('STNAME') \
              .apply(lambda x: x.head(3).sum(numeric_only=True)) \
              .reset_index() \
              .nlargest(3, 'CENSUS2010POP')
print (df)
  STNAME  CENSUS2010POP
5      s             17
1      b             14
2      c             11

3 nlargest , nlargest :

df1 = census_cp.groupby('STNAME')['CENSUS2010POP']
               .apply(lambda x: x.nlargest(3).sum())
               .nlargest(3)
               .reset_index()
print (df1)
  STNAME  CENSUS2010POP
0      s             17
1      b             14
2      c             13

:

df1 = census_cp.groupby('STNAME')['CENSUS2010POP'].nlargest(3)
               .groupby(level=0)
               .sum()
               .nlargest(3)
               .reset_index() 
print (df1)
  STNAME  CENSUS2010POP
0      s             17
1      b             14
2      c             13
+1

Source: https://habr.com/ru/post/1667872/


All Articles