How to select rows with maximum values ​​in each group using dplyr?

I would like to select the row with the maximum value in each group with dplyr.

Firstly, I generate some random data to show my question

set.seed(1) df <- expand.grid(list(A = 1:5, B = 1:5, C = 1:5)) df$value <- runif(nrow(df)) 

In plyr, I can use a custom function to select this line.

 library(plyr) ddply(df, .(A, B), function(x) x[which.max(x$value),]) 

In dplyr, I use this code to get the maximum value, but not for rows with the maximum value (in this case, column C is used).

 library(dplyr) df %>% group_by(A, B) %>% summarise(max = max(value)) 

How could I achieve this? Thanks for any suggestion.

 sessionInfo() R version 3.1.0 (2014-04-10) Platform: x86_64-w64-mingw32/x64 (64-bit) locale: [1] LC_COLLATE=English_Australia.1252 LC_CTYPE=English_Australia.1252 [3] LC_MONETARY=English_Australia.1252 LC_NUMERIC=C [5] LC_TIME=English_Australia.1252 attached base packages: [1] stats graphics grDevices utils datasets methods base other attached packages: [1] dplyr_0.2 plyr_1.8.1 loaded via a namespace (and not attached): [1] assertthat_0.1.0.99 parallel_3.1.0 Rcpp_0.11.1 [4] tools_3.1.0 
+80
r greatest-n-per-group dplyr plyr
Jun 16 '14 at 6:00
source share
6 answers

Try the following:

 result <- df %>% group_by(A, B) %>% filter(value == max(value)) %>% arrange(A,B,C) 

Seems to work:

 identical( as.data.frame(result), ddply(df, .(A, B), function(x) x[which.max(x$value),]) ) #[1] TRUE 

As pointed out by @docendo in the comments, slice might be preferable here according to @RoyalITS answer below, if you strictly want only 1 line per group. This answer will return multiple rows if there are several with the same maximum value.

+93
Jun 16 '14 at 6:10
source share

You can use top_n

 df %>% group_by(A, B) %>% top_n(n=1) 

This will rank by the last column ( value ) and return the top rows n=1 .

You cannot currently change this default without causing an error (see https://github.com/hadley/dplyr/issues/426 )

+56
Jun 16 '14 at 6:14
source share
 df %>% group_by(A,B) %>% slice(which.max(value)) 
+46
Feb 24 '16 at 16:40
source share

This more detailed solution provides more control over what happens in the case of a double maximum value (in this example, it will take one of the corresponding lines randomly)

 library(dplyr) df %>% group_by(A, B) %>% mutate(the_rank = rank(-value, ties.method = "random")) %>% filter(the_rank == 1) %>% select(-the_rank) 
+9
Jul 18 '16 at 7:59
source share

In general, I think you can get the "top" of the rows that are sorted in this group.

For the case when one value has a maximum value, you have sorted only one column. However, it is often useful to hierarchically sort by multiple columns (for example: date column and time column).

 # Answering the question of getting row with max "value". df %>% # Within each grouping of A and B values. group_by( A, B) %>% # Sort rows in descending order by "value" column. arrange( desc(value) ) %>% # Pick the top 1 value slice(1) %>% # Remember to ungroup in case you want to do further work without grouping. ungroup() # Answering an extension of the question of # getting row with the max value of the lowest "C". df %>% # Within each grouping of A and B values. group_by( A, B) %>% # Sort rows in ascending order by C, and then within that by # descending order by "value" column. arrange( C, desc(value) ) %>% # Pick the one top row based on the sort slice(1) %>% # Remember to ungroup in case you want to do further work without grouping. ungroup() 
0
Jan 16 '19 at 19:06
source share

For me, it helped count the number of values ​​per group. Copy the counting table to a new object. Then filter the group maximum based on the first grouping characteristic. For example:

 count_table <- df %>% group_by(A, B) %>% count() %>% arrange(A, desc(n)) count_table %>% group_by(A) %>% filter(n == max(n)) 

or

 count_table %>% group_by(A) %>% top_n(1, n) 
0
Feb 01 '19 at 14:39
source share



All Articles