Average between repeated lines in R

Question

Average between repeated lines in R

I have a df data frame with rows that are duplicates for the name column, but not for the value column:

 name value etc1 etc2 A 9 1 X A 10 1 X A 11 1 X B 2 1 Y C 40 1 Y C 50 1 Y

I need to combine duplicate names in one row, calculating the average over the column of values. The expected result is as follows:

 name value etc1 etc2 A 10 1 X B 2 1 Y C 45 1 Y

I tried using df[duplicated(df$name),] , but of course this does not give me duplicate values. I would like to use aggregate() , but the problem is that the FUN part of this function will be applied to all other columns, and also, among other problems, it will not be able to calculate the contents of char. Since all other columns have the same duplicate content, I need them to be aggregated, just like name columns. Any clues ...?

+4

r duplicates aggregate mean

biohazard Jun 29 '13 at 18:51

source share

4 answers

You can use the aggregate() function as shown below:

 aggregate(df$value,by=list(name=df$name,etc1=df$etc1,etc2=df$etc2),data=df,FUN=mean)

+7

Homa ghiasi Feb 17 '15 at 14:08

source share

The code (written in metrics) almost works, except for one place (.name). I changed it a bit:

 sample<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, -6L)) sample.m <- ddply(sample, 'name', summarize, value =mean(value), etc1=head(etc1,1), etc2=head(etc2,1)) sample.m name value etc1 etc2 1 A 10 1 X 2 B 2 1 Y 3 C 45 1 Y

+2

S das Jun 29 '13 at 19:34

source share

Assuming your dataframe is df.

 install.packages("plyr") library(plyr) df<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, -6L)) df.m<-ddply(df,.(name),summarize, value=mean(value),etc1=head(etc1,1),etc2=head(etc2,1)) df.m name value etc1 etc2 1 A 10 1 X 2 B 2 1 Y 3 C 45 1 Y

+1

Metrics Jun 29 '13 at 18:58

source share

agstudy · Accepted Answer · 2013-06-29T20:11:05+0000

Here is a data.table solution. The solution is general in the sense that it will work even for data.frame with 60 columns. Since I group the data by all variables other than the value (see How I create keys below)

 library(data.table) dat <- read.table(text='name value etc1 etc2 A 9 1 X A 10 1 X A 11 1 X B 2 1 Y C 40 1 Y C 50 1 Y',header=TRUE) keys <- colnames(dat)[!grepl('value',colnames(dat))] X <- as.data.table(dat) X[,list(mm= mean(value)),keys] name etc1 etc2 mm 1: A 1 X 10 2: B 1 Y 2 3: C 1 Y 45

EDIT extends to more than one value variable

If you have several numeric variables that you want to calculate the average for, for example, if your data looks like

  name value etc1 etc2 value1 1 A 9 1 X 2.1763485 2 A 10 1 X -0.7954326 3 A 11 1 X -0.5839844 4 B 2 1 Y -0.5188709 5 C 40 1 Y -0.8300233 6 C 50 1 Y -0.7787496

The above solution can be continued as follows:

 X[,lapply(.SD,mean),keys] name etc1 etc2 value value1 1: A 1 X 10 0.2656438 2: B 1 Y 2 -0.5188709 3: C 1 Y 45 -0.8043865

This will calculate the average of all variables that do not exist in the key list.

Average between repeated lines in R

More articles: