Average between repeated lines in R

I have a df data frame with rows that are duplicates for the name column, but not for the value column:

 name value etc1 etc2 A 9 1 X A 10 1 X A 11 1 X B 2 1 Y C 40 1 Y C 50 1 Y 

I need to combine duplicate names in one row, calculating the average over the column of values. The expected result is as follows:

 name value etc1 etc2 A 10 1 X B 2 1 Y C 45 1 Y 

I tried using df[duplicated(df$name),] , but of course this does not give me duplicate values. I would like to use aggregate() , but the problem is that the FUN part of this function will be applied to all other columns, and also, among other problems, it will not be able to calculate the contents of char. Since all other columns have the same duplicate content, I need them to be aggregated, just like name columns. Any clues ...?

+4
source share
4 answers

Here is a data.table solution. The solution is general in the sense that it will work even for data.frame with 60 columns. Since I group the data by all variables other than the value (see How I create keys below)

 library(data.table) dat <- read.table(text='name value etc1 etc2 A 9 1 X A 10 1 X A 11 1 X B 2 1 Y C 40 1 Y C 50 1 Y',header=TRUE) keys <- colnames(dat)[!grepl('value',colnames(dat))] X <- as.data.table(dat) X[,list(mm= mean(value)),keys] name etc1 etc2 mm 1: A 1 X 10 2: B 1 Y 2 3: C 1 Y 45 

EDIT extends to more than one value variable

If you have several numeric variables that you want to calculate the average for, for example, if your data looks like

  name value etc1 etc2 value1 1 A 9 1 X 2.1763485 2 A 10 1 X -0.7954326 3 A 11 1 X -0.5839844 4 B 2 1 Y -0.5188709 5 C 40 1 Y -0.8300233 6 C 50 1 Y -0.7787496 

The above solution can be continued as follows:

 X[,lapply(.SD,mean),keys] name etc1 etc2 value value1 1: A 1 X 10 0.2656438 2: B 1 Y 2 -0.5188709 3: C 1 Y 45 -0.8043865 

This will calculate the average of all variables that do not exist in the key list.

+8
source

You can use the aggregate() function as shown below:

 aggregate(df$value,by=list(name=df$name,etc1=df$etc1,etc2=df$etc2),data=df,FUN=mean) 
+7
source

The code (written in metrics) almost works, except for one place (.name). I changed it a bit:

 sample<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, -6L)) sample.m <- ddply(sample, 'name', summarize, value =mean(value), etc1=head(etc1,1), etc2=head(etc2,1)) sample.m name value etc1 etc2 1 A 10 1 X 2 B 2 1 Y 3 C 45 1 Y 
+2
source

Assuming your dataframe is df.

 install.packages("plyr") library(plyr) df<- structure(list(name = structure(c(1L, 1L, 1L, 2L, 3L, 3L), .Label = c("A", "B", "C"), class = "factor"), value = c(9L, 10L, 11L, 2L, 40L, 50L), etc1 = c(1L, 1L, 1L, 1L, 1L, 1L), etc2 = structure(c(1L, 1L, 1L, 2L, 2L, 2L), .Label = c("X", "Y"), class = "factor")), .Names = c("name", "value", "etc1", "etc2"), class = "data.frame", row.names = c(NA, -6L)) df.m<-ddply(df,.(name),summarize, value=mean(value),etc1=head(etc1,1),etc2=head(etc2,1)) df.m name value etc1 etc2 1 A 10 1 X 2 B 2 1 Y 3 C 45 1 Y 
+1
source

Source: https://habr.com/ru/post/1488888/


All Articles