dplyr mutate with conditional values

In a large data frame ("myfile") with four columns, I need to add a fifth column with values ​​conditionally based on the first four columns.

Prefer answers with dplyr and mutate , mainly because of its speed in large datasets.

My data frame looks like this:

  V1 V2 V3 V4 1 1 2 3 5 2 2 4 4 1 3 1 4 1 1 4 4 5 1 3 5 5 5 5 4 ... 

The values ​​of the fifth column (V5) are based on some conditional rules:

 if (V1==1 & V2!=4) { V5 <- 1 } else if (V2==4 & V3!=1) { V5 <- 2 } else { V5 <- 0 } 

Now I want to use the mutate function to use these rules on all lines (to avoid slow loops). Something like this (and yes, I know that doesn't work like that!):

 myfile <- mutate(myfile, if (V1==1 & V2!=4){V5 = 1} else if (V2==4 & V3!=1){V5 = 2} else {V5 = 0}) 

This should be the result of:

  V1 V2 V3 V4 V5 1 1 2 3 5 1 2 2 4 4 1 2 3 1 4 1 1 0 4 4 5 1 3 0 5 5 5 5 4 0 

How to do it in dplyr ?

+63
r dplyr mutate
Mar 11 '14 at 21:48
source share
3 answers

Try this:

 myfile %>% mutate(V5 = (V1 == 1 & V2 != 4) + 2 * (V2 == 4 & V3 != 1)) 

by giving:

  V1 V2 V3 V4 V5 1 1 2 3 5 1 2 2 4 4 1 2 3 1 4 1 1 0 4 4 5 1 3 0 5 5 5 5 4 0 

or that:

 myfile %>% mutate(V5 = ifelse(V1 == 1 & V2 != 4, 1, ifelse(V2 == 4 & V3 != 1, 2, 0))) 

by giving:

  V1 V2 V3 V4 V5 1 1 2 3 5 1 2 2 4 4 1 2 3 1 4 1 1 0 4 4 5 1 3 0 5 5 5 5 4 0 

The note

I suggest you get the best name for your data frame. myfile gives the impression that it contains the file name.

Above used this input:

 myfile <- structure(list(V1 = c(1L, 2L, 1L, 4L, 5L), V2 = c(2L, 4L, 4L, 5L, 5L), V3 = c(3L, 4L, 1L, 1L, 5L), V4 = c(5L, 1L, 1L, 3L, 4L )), .Names = c("V1", "V2", "V3", "V4"), class = "data.frame", row.names = c("1", "2", "3", "4", "5")) 

Update 1 Since dplyr originally published it, changed %.% %% %>% So it changed the answer accordingly.

Update 2 dplyr now has case_when which provides another solution:

 myfile %>% mutate(V5 = case_when(V1 == 1 & V2 != 4 ~ 1, V2 == 4 & V3 != 1 ~ 2, TRUE ~ 0)) 
+79
Mar 11 '14 at 21:52
source share

With dplyr 0.7.2 you can use the very useful case_when function:

 x=read.table( text="V1 V2 V3 V4 1 1 2 3 5 2 2 4 4 1 3 1 4 1 1 4 4 5 1 3 5 5 5 5 4") x$V5 = case_when(x$V1==1 & x$V2!=4 ~ 1, x$V2==4 & x$V3!=1 ~ 2, TRUE ~ 0) 

Expressed by dplyr::mutate , it gives:

 x = x %>% mutate( V5 = case_when( V1==1 & V2!=4 ~ 1, V2==4 & V3!=1 ~ 2, TRUE ~ 0 ) ) 

Please note that NA not specifically addressed, as this can be misleading. The function will return NA only when no conditions are found. If you put a string with TRUE ~... , as I did in my example, the return value will never be NA .

Therefore, you should expressively say case_when put NA in place by adding an operator like is.na(x$V1) | is.na(x$V3) ~ NA_integer_ is.na(x$V1) | is.na(x$V3) ~ NA_integer_ is.na(x$V1) | is.na(x$V3) ~ NA_integer_ is.na(x$V1) | is.na(x$V3) ~ NA_integer_ . Hint: dplyr::coalesce() can sometimes be very useful here!

Also, note that NA alone will usually not work; you must NA_integer_ NA special values: NA_integer_ , NA_character_ or NA_real_ .

+18
Jul 31 '17 at 13:17
source share

It looks like the derivedFactor from the mosaic package was designed for this. In this example, it looks something like this:

 library(mosaic) myfile <- mutate(myfile, V5 = derivedFactor( "1" = (V1==1 & V2!=4), "2" = (V2==4 & V3!=1), .method = "first", .default = 0 )) 

(If you want the result to be numeric instead of factor, wrap derivedFactor with as.numeric .)

Note that the .default parameter in combination with .method = "first" sets the condition to "else" - this approach is described in the help file for derivedFactor .

+11
Oct 22 '15 at 20:14
source share



All Articles