How to transfer the first change in the value of a variable between years, for each group?

Given a very large longitudinal data set with different groups, I need to create a flag that indicates the first change in a specific variable ( code) between years ( year), for each group ( id). Observation typeduring the same id-year indicates only different members of the group.

Sample data:

library(tidyverse)    
sample <- tibble(id = rep(1:3, each=6),
                     year = rep(2010:2012, 3, each=2),
                     type = (rep(1:2, 9)),
                     code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","","klm","nop","nop"))

I need to indicate the first change codein the group, between years. The second change does not matter. Missing codes ( "") can be considered as NA, but in any case should not affect flag. Below is the header with the flag field, as it should be:

# A tibble: 18 Γ— 5
      id  year  type  code  flag
   <int> <int> <int> <chr> <dbl>
1      1  2010     1   abc     0
2      1  2010     2   abc     0
3      1  2011     1           0
4      1  2011     2           0
5      1  2012     1   xyz     1
6      1  2012     2   xyz     1
7      2  2010     1           0
8      2  2010     2           0
9      2  2011     1   lmn     0
10     2  2011     2           0
11     2  2012     1   efg     1
12     2  2012     2   efg     1
13     3  2010     1   def     0
14     3  2010     2   def     0
15     3  2011     1           1
16     3  2011     2   klm     1
17     3  2012     1   nop     1
18     3  2012     2   nop     1

, dplyr, , . !

EDIT: year. , , types id year . , 15 e code "", , 16 code, , 1.

+4
3

dplyr. , ,

sample %>% 
  group_by(id) %>% 
  #find first year per group where code exists
  mutate(first_year = min(year[code != ""])) %>% 
  #gather all codes from first year (does not assume code is constant within year)
  mutate(first_codes = list(code[year==first_year])) %>% 
  #if year is not first year & code not in first year codes & code not blank
  mutate(flag = as.numeric(year != first_year & !(code %in% unlist(first_codes)) & code != "")) %>% 
  #drop created columns
  select(-first_year, -first_codes) %>% 
  ungroup()

# A tibble: 18 Γ— 5
      id  year  type  code  flag
   <int> <int> <int> <chr> <dbl>
1      1  2010     1   abc     0
2      1  2010     2   abc     0
3      1  2011     1           0
4      1  2011     2           0
5      1  2012     1   xyz     1
6      1  2012     2   xyz     1
7      2  2010     1           0
8      2  2010     2           0
9      2  2011     1   lmn     0
10     2  2011     2           0
11     2  2012     1   efg     1
12     2  2012     2   efg     1
13     3  2010     1   def     0
14     3  2010     2   def     0
15     3  2011     1   klm     1
16     3  2011     2   klm     1
17     3  2012     1   nop     1
18     3  2012     2   nop     1
+2

data.table

library(data.table)
setDT(sample)[, flag :=0][code!="",  flag := {rl <- rleid(code)-1; cummax(rl*(rl < 2)) }, id]
sample
#    id year type code flag
# 1:  1 2010    1  abc    0
# 2:  1 2010    2  abc    0
# 3:  1 2011    1         0
# 4:  1 2011    2         0
# 5:  1 2012    1  xyz    1
# 6:  1 2012    2  xyz    1
# 7:  2 2010    1         0
# 8:  2 2010    2         0
# 9:  2 2011    1  lmn    0
#10:  2 2011    2         0
#11:  2 2012    1  efg    1
#12:  2 2012    2  efg    1
#13:  3 2010    1  def    0
#14:  3 2010    2  def    0
#15:  3 2011    1  klm    1
#16:  3 2011    2  klm    1
#17:  3 2012    1  nop    1
#18:  3 2012    2  nop    1

Update

"",

setDT(sample)[, flag :=0][code!="",  flag := {rl <- rleid(code, year)-1
                   cummax(rl*(rl < 2)) }, id]
+3

A short solution with the data.table-package package :

library(data.table)
setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code)-1 > 0), by = id]

Or:

setDT(samp)[, flag := 0][code!="", flag := 1*(code!=code[1] & code!=''), by = id][]

which gives the desired result:

> samp
    id year type code flag
 1:  1 2010    1  abc    0
 2:  1 2010    2  abc    0
 3:  1 2011    1         0
 4:  1 2011    2         0
 5:  1 2012    1  xyz    1
 6:  1 2012    2  xyz    1
 7:  2 2010    1         0
 8:  2 2010    2         0
 9:  2 2011    1  lmn    0
10:  2 2011    2         0
11:  2 2012    1  efg    1
12:  2 2012    2  efg    1
13:  3 2010    1  def    0
14:  3 2010    2  def    0
15:  3 2011    1  klm    1
16:  3 2011    2  klm    1
17:  3 2012    1  nop    1
18:  3 2012    2  nop    1

Or, when the year also matters:

setDT(samp)[, flag := 0][code!="", flag := 1*(rleid(code, year)-1 > 0), id]

Possible alternative to basic R:

f <- function(x) {
  x <- rle(x)$lengths
  1 * (rep(seq_along(x), times=x) - 1 > 0)
}

samp$flag <- 0
samp$flag[samp$code!=''] <- with(samp[samp$code!=''], ave(as.character(code), id, FUN = f))

NOTE. It’s better not to give the object the same name as the function.

Used data:

samp <- data.frame(id = rep(1:3, each=6),
                   year = rep(2010:2012, 3, each=2),
                   type = (rep(1:2, 9)),
                   code = c("abc","abc","","","xyz","xyz", "","","lmn","","efg","efg","def","def","klm","klm","nop","nop"))
+2
source

Source: https://habr.com/ru/post/1676834/


All Articles