Categorize a continuous variable with dplyr

Question

Categorize a continuous variable with dplyr

I want to create a new variable with 3 arbitrary categories based on continuous data.

set.seed(123)
df <- data.frame(a = rnorm(100))

Using the base, I would

df$category[df$a < 0.5] <- "low"
df$category[df$a > 0.5 & df$a < 0.6] <- "middle"
df$category[df$a > 0.6] <- "high"

Is there dplyr, I think mutate(), a solution for this?

Also, is there a way to calculate categories rather than selecting them? That is, let R calculate where the gaps for the categories should be.

EDIT

The answer is in this thread, however, it is not related to the labeling, which confused me (and can confuse others), so I believe that this question serves the purpose.

+4

r dplyr

Filipw Nov 02 '16 at 12:35

source share

2 answers

aichao · Answer 1 · 2016-11-02T12:45:34+0000

To convert from numeric to categorical, use cut. In your particular case, you want:

df$category <- cut(df$a, breaks=c(-Inf, 0.5, 0.6, Inf), labels=c("low","middle","high"))

Or using dplyr:

library(dplyr)
res <- df %>% mutate(category=cut(a, breaks=c(-Inf, 0.5, 0.6, Inf), labels=c("low","middle","high")))
##               a category
##1   -0.560475647      low
##2   -0.230177489      low
##3    1.558708314     high
##4    0.070508391      low
##5    0.129287735      low
## ...
##35   0.821581082     high
##36   0.688640254     high
##37   0.553917654   middle
##38  -0.061911711      low
##39  -0.305962664      low
##40  -0.380471001      low
## ...
##96  -0.600259587      low
##97   2.187332993     high
##98   1.532610626     high
##99  -0.235700359      low
##100 -1.026420900      low

Robert · Answer 2 · 2016-11-02T13:01:58+0000

quantiles cut

xs=quantile(df$a,c(0,1/3,2/3,1))
xs[1]=xs[1]-.00005
df1 <- df %>% mutate(category=cut(a, breaks=xs, labels=c("low","middle","high")))
boxplot(df1$a~df1$category,col=3:5)

Categorize a continuous variable with dplyr

More articles: