Dplyr: maximum value in a group, excluding a value in each row?

Question

Dplyr: maximum value in a group, excluding a value in each row?

I have a data frame that looks like this:

> df <- data_frame(g = c('A', 'A', 'B', 'B', 'B', 'C'), x = c(7, 3, 5, 9, 2, 4)) > df Source: local data frame [6 x 2] gx 1 A 7 2 A 3 3 B 5 4 B 9 5 B 2 6 C 4

I know how to add a column with a maximum x value for each g group:

 > df %>% group_by(g) %>% mutate(x_max = max(x)) Source: local data frame [6 x 3] Groups: g gx x_max 1 A 7 7 2 A 3 7 3 B 5 9 4 B 9 9 5 B 2 9 6 C 4 4

But I need to get the maximum x value for each g group, excluding the x value in each row.

In this example, the desired result will look like this:

 Source: local data frame [6 x 3] Groups: g gx x_max x_max_exclude 1 A 7 7 3 2 A 3 7 7 3 B 5 9 9 4 B 9 9 5 5 B 2 9 9 6 C 4 4 NA

I thought I could use row_number() to remove certain elements and the maximum number of remaining, but delete warning messages and get the wrong -Inf output:

 > df %>% group_by(g) %>% mutate(x_max = max(x), r = row_number(), x_max_exclude = max(x[-r])) Source: local data frame [6 x 5] Groups: g gx x_max r x_max_exclude 1 A 7 7 1 -Inf 2 A 3 7 2 -Inf 3 B 5 9 1 -Inf 4 B 9 9 2 -Inf 5 B 2 9 3 -Inf 6 C 4 4 1 -Inf Warning messages: 1: In max(c(4, 9, 2)[-1:3]) : no non-missing arguments to max; returning -Inf 2: In max(c(4, 9, 2)[-1:3]) : no non-missing arguments to max; returning -Inf 3: In max(c(4, 9, 2)[-1:3]) : no non-missing arguments to max; returning -Inf

What is the most readable, concise, efficient way to get this output in dplyr? Any insight into why my attempt using row_number() does not work is also much appreciated. Thanks for the help.

+6

r greatest-n-per-group dplyr

Eric Jun 11 '15 at 1:19

source share

4 answers

An interesting problem. Here is one way: data.table :

 require(data.table) setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g]

The idea is to sort by column x and by these indices we group by g . Since we have ordered indexes, for the first lines of .N-1 maximum value is the value in .N . And for the .N th row, this is the value in the .N-1 row.

.N is a special variable that contains the number of observations in each group.

I will leave it to you and / or dplyr experts to translate this (or respond in another way).

+3

Arun Jun 11 '15 at 1:58

source share

This is the best I've come up with so far. Not sure if there is a better way.

 df %>% group_by(g) %>% mutate(x_max = max(x), x_max2 = sort(x, decreasing = TRUE)[2], x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% select(-x_max2)

+2

Eric Jun 11 '15 at 2:05

source share

Another way with functionality:

 df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x)) Source: local data frame [6 x 3] Groups: g gx x_max_exclude 1 A 7 3 2 A 3 7 3 B 5 9 4 B 9 5 5 B 2 9 6 C 4 NA

We write a function called max_exclude that performs the operation you described.

 max_exclude <- function(v) { res <- c() for(i in seq_along(v)) { res[i] <- suppressWarnings(max(v[-i])) } res <- ifelse(!is.finite(res), NA, res) as.numeric(res) }

It also works with base R :

 df$x_max_exclude <- with(df, ave(x, g, FUN=max_exclude)) Source: local data frame [6 x 3] gx x_max_exclude 1 A 7 3 2 A 3 7 3 B 5 9 4 B 9 5 5 B 2 9 6 C 4 NA

Benchmark

Here is a lesson for children, beware of cycles!

 big.df <- data.frame(g=rep(LETTERS[1:4], each=1e3), x=sample(10, 4e3, replace=T)) microbenchmark( plafort_dplyr = big.df %>% group_by(g) %>% mutate(x_max_exclude = max_exclude(x)), plafort_ave = big.df$x_max_exclude <- with(big.df, ave(x, g, FUN=max_exclude)), StevenB = (big.df %>% group_by(g) %>% mutate(max = ifelse(row_number(desc(x)) == 1, x[row_number(desc(x)) == 2], max(x))) ), Eric = df %>% group_by(g) %>% mutate(x_max = max(x), x_max2 = sort(x, decreasing = TRUE)[2], x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% select(-x_max2), Arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g] ) Unit: milliseconds expr min lq mean median uq max neval plafort_dplyr 75.219042 85.207442 89.247409 88.203225 90.627663 179.553166 100 plafort_ave 75.907798 84.604180 87.136122 86.961251 89.431884 104.884294 100 StevenB 4.436973 4.699226 5.207548 4.931484 5.364242 11.893306 100 Eric 7.233057 8.034092 8.921904 8.414720 9.060488 15.946281 100 Arun 1.789097 2.037235 2.410915 2.226988 2.423638 9.326272 100

0

Pierre lafortune Jun 11 '15 at 2:26

source share

Steven beaupré · Accepted Answer · 2015-06-11T02:38:40+0000

You can try:

 df %>% group_by(g) %>% arrange(desc(x)) %>% mutate(max = ifelse(x == max(x), x[2], max(x)))

What gives:

 #Source: local data frame [6 x 3] #Groups: g # # gx max #1 A 7 3 #2 A 3 7 #3 B 9 5 #4 B 5 9 #5 B 2 9 #6 C 4 NA

Benchmark

I already tried the solutions according to the standard:

 df <- data.frame(g = sample(LETTERS, 10e5, replace = TRUE), x = sample(1:10, 10e5, replace = TRUE)) library(microbenchmark) mbm <- microbenchmark( steven = df %>% group_by(g) %>% arrange(desc(x)) %>% mutate(max = ifelse(x == max(x), x[2], max(x))), eric = df %>% group_by(g) %>% mutate(x_max = max(x), x_max2 = sort(x, decreasing = TRUE)[2], x_max_exclude = ifelse(x == x_max, x_max2, x_max)) %>% select(-x_max2), arun = setDT(df)[order(x), x_max_exclude := c(rep(x[.N], .N-1L), x[.N-1L]), by=g], times = 50 )

@Arun data.table's solution is the fastest:

 # Unit: milliseconds # expr min lq mean median uq max neval cld # steven 158.58083 163.82669 197.28946 210.54179 212.1517 260.1448 50 b # eric 223.37877 228.98313 262.01623 274.74702 277.1431 284.5170 50 c # arun 44.48639 46.17961 54.65824 47.74142 48.9884 102.3830 50 a

Dplyr: maximum value in a group, excluding a value in each row?

Benchmark

More articles: