Example statement equivalent in R

I have a variable in a data frame where one of the fields usually has 7-8 values. I want them to divide them into 3 or 4 new categories within the new variable inside the data frame. What is the best approach?

I would use the CASE statement if I were in an SQL-like tool, but not sure how to attack this in R.

Any help you can provide would be greatly appreciated!

+48
r
Jan 07 2018-11-11T00:
source share
13 answers

Take a look at the cases function from the memisc package. It implements case functionality with two different ways to use it. From the examples in the package:

 z1=cases( "Condition 1"=x<0, "Condition 2"=y<0,# only applies if x >= 0 "Condition 3"=TRUE ) 

where x and y are two vectors.

+23
Jan 07 '11 at 18:15
source share

If you get factor , you can change the levels using the standard method:

 df <- data.frame(name = c('cow','pig','eagle','pigeon'), stringsAsFactors = FALSE) df$type <- factor(df$name) # First step: copy vector and make it factor # Change levels: levels(df$type) <- list( animal = c("cow", "pig"), bird = c("eagle", "pigeon") ) df # name type # 1 cow animal # 2 pig animal # 3 eagle bird # 4 pigeon bird 

You can write a simple function as a wrapper:

 changelevels <- function(f, ...) { f <- as.factor(f) levels(f) <- list(...) f } df <- data.frame(name = c('cow','pig','eagle','pigeon'), stringsAsFactors = TRUE) df$type <- changelevels(df$name, animal=c("cow", "pig"), bird=c("eagle", "pigeon")) 
+17
Sep 12 '11 at 15:57
source share

The switch method is used here:

 df <- data.frame(name = c('cow','pig','eagle','pigeon'), stringsAsFactors = FALSE) df$type <- sapply(df$name, switch, cow = 'animal', pig = 'animal', eagle = 'bird', pigeon = 'bird') > df name type 1 cow animal 2 pig animal 3 eagle bird 4 pigeon bird 

The only drawback to this is that you must continue to write the category name ( animal , etc.) for each item. It is syntactically more convenient to define our categories as shown below (see a very similar question How to add a column to a data frame in R )

 myMap <- list(animal = c('cow', 'pig'), bird = c('eagle', 'pigeon')) 

and we want to somehow "invert" this mapping. I am writing my own invMap function:

 invMap <- function(map) { items <- as.character( unlist(map) ) nams <- unlist(Map(rep, names(map), sapply(map, length))) names(nams) <- items nams } 

and then invert the above mapping as follows:

 > invMap(myMap) cow pig eagle pigeon "animal" "animal" "bird" "bird" 

And then it's easy to use this to add a type column to the data frame:

 df <- transform(df, type = invMap(myMap)[name]) > df name type 1 cow animal 2 pig animal 3 eagle bird 4 pigeon bird 
+13
Jan 07 2018-11-11T00:
source share

Imho, the simplest and most universal code:

 dft=data.frame(x = sample(letters[1:8], 20, replace=TRUE)) dft=within(dft,{ y=NA y[x %in% c('a','b','c')]='abc' y[x %in% c('d','e','f')]='def' y[x %in% 'g']='g' y[x %in% 'h']='h' }) 
+11
Jan 07 2018-11-11T00:
source share

I do not see offers for the "switch". Sample code (run it):

 x <- "three"; y <- 0; switch(x, one = {y <- 5}, two = {y <- 12}, three = {y <- 432}) y 
+9
Jul 11 '16 at 12:57
source share

You can use recode from the car package:

 library(ggplot2) #get data library(car) daimons$new_var <- recode(diamonds$clarity , "'I1' = 'low';'SI2' = 'low';else = 'high';")[1:10] 
+4
Jan 07 '11 at 3:16
source share

There is a switch , but I can never make it work as it seems to me. Since you did not provide an example, I will do one using a factor variable:

  dft <-data.frame(x = sample(letters[1:8], 20, replace=TRUE)) levels(dft$x) [1] "a" "b" "c" "d" "e" "f" "g" "h" 

If you specify the categories that you want in the order corresponding to the reassignment, you can use the factor or numeric variables as an index:

 c("abc", "abc", "abc", "def", "def", "def", "g", "h")[dft$x] [1] "def" "h" "g" "def" "def" "abc" "h" "h" "def" "abc" "abc" "abc" "h" "h" "abc" [16] "def" "abc" "abc" "def" "def" dft$y <- c("abc", "abc", "abc", "def", "def", "def", "g", "h")[dft$x] str(dft) 'data.frame': 20 obs. of 2 variables: $ x: Factor w/ 8 levels "a","b","c","d",..: 4 8 7 4 6 1 8 8 5 2 ... $ y: chr "def" "h" "g" "def" ... 

Later I learned that there are actually two different switching functions. This is not a general function, but you should think of it as switch.numeric or switch.character . If your first argument is an R 'factor, you get switch.numeric behavior that can cause problems, as most people see the factors displayed as a character and make the wrong assumption that all functions will treat them as such.

+4
Jan 07 2018-11-11T00:
source share

I do not like any of them, they are not clear to the reader or potential user. I just use an anonymous function, the syntax is not as smooth as the case argument, but the evaluation is similar to the case argument, and not that it hurts. it also assumes that you evaluate it where your variables are defined.

 result <- ( function() { if (x==10 | y< 5) return('foo') if (x==11 & y== 5) return('bar') })() 

all of them () must be concluded and evaluated anonymous function.

+4
Sep 09 '11 at 20:28
source share

case_when() , which was added to dplyr in May 2016, solves this problem similarly to memisc::cases() .

For example:

 library(dplyr) mtcars %>% mutate(category = case_when( .$cyl == 4 & .$disp < median(.$disp) ~ "4 cylinders, small displacement", .$cyl == 8 & .$disp > median(.$disp) ~ "8 cylinders, large displacement", TRUE ~ "other" ) ) 
+4
Jan 26 '17 at 3:51 on
source share

An actual example may turn out to be wrong. If this is a factor that is likely to simply set the factor levels accordingly.

Say you have a factor with letters A through E like this.

 > a <- factor(rep(LETTERS[1:5],2)) > a [1] ABCDEABCDE Levels: ABCDE 

To join levels B and C and call it BC, simply change the names of these levels to BC.

 > levels(a) <- c("A","BC","BC","D","E") > a [1] A BC BC DEA BC BC DE Levels: A BC DE 

The result is optional.

+1
Sep 10 2018-11-11T00:
source share

If you want to have sql-like syntax, you can just use the sqldf package. The function to be used is also called sqldf , and the syntax is as follows

 sqldf(<your query in quotation marks>) 
+1
Nov 17 '13 at 11:58 on
source share

You can use the base merge function for case-style remapping tasks:

 df <- data.frame(name = c('cow','pig','eagle','pigeon','cow','eagle'), stringsAsFactors = FALSE) mapping <- data.frame( name=c('cow','pig','eagle','pigeon'), category=c('animal','animal','bird','bird') ) merge(df,mapping) # name category # 1 cow animal # 2 cow animal # 3 eagle bird # 4 eagle bird # 5 pig animal # 6 pigeon bird 
+1
Apr 15 '17 at 21:28
source share

Mixing plyr::mutate and dplyr::case_when works for me and is readable.

 iris %>% plyr::mutate(coolness = dplyr::case_when(Species == "setosa" ~ "not cool", Species == "versicolor" ~ "not cool", Species == "virginica" ~ "super awesome", TRUE ~ "undetermined" )) -> testIris head(testIris) levels(testIris$coolness) ## NULL testIris$coolness <- as.factor(testIris$coolness) levels(testIris$coolness) ## ok now testIris[97:103,4:6] 

Bonus points if the column can exit the mutate as a factor instead of char! The last line of the case_when statement, which captures all inconsistent lines, is very important.

  Petal.Width Species coolness 97 1.3 versicolor not cool 98 1.3 versicolor not cool 99 1.1 versicolor not cool 100 1.3 versicolor not cool 101 2.5 virginica super awesome 102 1.9 virginica super awesome 103 2.1 virginica super awesome 
0
03 Aug '17 at 7:59 on
source share



All Articles