How to apply a function to a subset of columns in r?

I use by to apply a function to the columns of a range of data frame based on a coefficient. Everything works fine if I use mean() as a function, but if I use median() , I get an error like "Error in median.default (x): numeric data needed" even if I don't have NA in the data frame.

A line that works with mean() :

 by(iris[,1:3], iris$Species, function(x) mean(x,na.rm=T)) > by(iris[,1:3], iris$Species, function(x) mean(x,na.rm=T)) iris$Species: setosa Sepal.Length Sepal.Width Petal.Length 5.006 3.428 1.462 ------------------------------------------------------------ iris$Species: versicolor Sepal.Length Sepal.Width Petal.Length 5.936 2.770 4.260 ------------------------------------------------------------ iris$Species: virginica Sepal.Length Sepal.Width Petal.Length 6.588 2.974 5.552 Warning messages: 1: mean(<data.frame>) is deprecated. Use colMeans() or sapply(*, mean) instead. 2: mean(<data.frame>) is deprecated. Use colMeans() or sapply(*, mean) instead. 3: mean(<data.frame>) is deprecated. Use colMeans() or sapply(*, mean) instead. 

But if I use median() (pay attention to na.rm=T option ):

 > by(iris[,1:3], iris$Species, function(x) median(x,na.rm=T)) Error in median.default(x, na.rm = T) : need numeric data 

However, if instead of selecting a range of [,1:3] columns, I select only one of the columns that it works:

 > by(iris[,1], iris$Species, function(x) median(x,na.rm=T)) iris$Species: setosa [1] 5 ------------------------------------------------------------ iris$Species: versicolor [1] 5.9 ------------------------------------------------------------ iris$Species: virginica [1] 6.5 

How can I achieve this behavior when choosing a range of columns?

+6
source share
2 answers

You use the split-apply strategy when using by . The objects passed to this function are dataframes, and you get a warning and an error due to the lack of median.data.frame and the impending absence of mean.data.frame . This might work better if you used aggregate :

 > aggregate(iris[,1:3], iris["Species"], function(x) mean(x,na.rm=T)) Species Sepal.Length Sepal.Width Petal.Length 1 setosa 5.006 3.428 1.462 2 versicolor 5.936 2.770 4.260 3 virginica 6.588 2.974 5.552 > aggregate(iris[,1:3], iris["Species"], function(x) median(x,na.rm=T)) Species Sepal.Length Sepal.Width Petal.Length 1 setosa 5.0 3.4 1.50 2 versicolor 5.9 2.8 4.35 3 virginica 6.5 3.0 5.55 

aggregate works on column vectors individually and then tabulates the results.

+4
source

The answer to the original question. If, however, the range turns out to be (instead of) all columns, except those indicated as an independent variable in the formula, precise label notation works and is a great option:

 > aggregate(. ~ Species, data = iris, mean) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa 5.006 3.428 1.462 0.246 2 versicolor 5.936 2.770 4.260 1.326 3 virginica 6.588 2.974 5.552 2.026 > aggregate(. ~ Species, data = iris, median) Species Sepal.Length Sepal.Width Petal.Length Petal.Width 1 setosa 5.0 3.4 1.50 0.2 2 versicolor 5.9 2.8 4.35 1.3 3 virginica 6.5 3.0 5.55 2.0 
+1
source

Source: https://habr.com/ru/post/909759/


All Articles