Create monthly data and expand data

I have a data frame, and I want to create an unbalanced panel based on the following data set.

profile<- c('lehman', 'john','oliver','stephen','picasso') start_date<- c(2008-01-01, 2008-02-02,2008-04-02,2008-09-02,2009-02-02) end_date <- c (2009-12-31, 2009-12-31, 2009-12-31,2009-12-31,2009-12-31) df<- data.frame(profile,start_date,end_date) 

I would like to create two columns tid and myear. Myear is basically a month starting from the start date, and it continues to expand based on the months until the last end date. Then I need a teed that is encoded as 01 for myear 01-2008 and 02 for 02-2008 .... so 12-2009 as 24. Can anyone suggest how this can be done? Here is the expected result.

  profile start_date end_date tid myear lehman 2008-01-01 2009-12-31 01 01-2008 lehman 2008-01-01 2009-12-31 02 02-2008 ... .. .. .. lehman 2008-01-01 2009-12-31 24 12-2009 john 2008-02-02 2009-12-31 02 02-2008 john 2008-02-02 2009-12-31 03 03-2008 .. .. .. .. john 2008-02-02 2009-12-31 24 12-2009 ... .. ... .. picasso 2009-02-02 2009-12-31 14 02-2009 picasso 2009-03-02 2009-12-31 15 03-2009 ... ... ... .. 
+5
source share
4 answers

Here is an idea. First make sure your dates are as.Date (i.e. df[2:3] <- lapply(df[2:3], function(i) as.Date(i, format = '%Y-%m-%d')) ). Then create a list with a monthly sequence between the start and end dates. Count the lengths of this list and use them to expand your data frame. Add the date sequence as a new column and create a tid based on each profile length.

 seq_lst <- lapply(Map(function(x, y) seq(x, y, by = 'months'), df$start_date, df$end_date), function(i) format(i, '%m-%Y')) df <- df[rep(seq_len(nrow(df)), lengths(seq_lst)),] df$myear <- unlist(seq_lst) i1 <- setNames(seq(length(seq_lst[[1]])), seq_lst[[1]]) df$tid <- sprintf('%02d', i1[match(df$myear, names(i1))]) head(df) # profile start_date end_date myear tid #1 lehman 2008-01-01 2009-12-31 01-2008 01 #1.1 lehman 2008-01-01 2009-12-31 02-2008 02 #1.2 lehman 2008-01-01 2009-12-31 03-2008 03 #1.3 lehman 2008-01-01 2009-12-31 04-2008 04 #1.4 lehman 2008-01-01 2009-12-31 05-2008 05 #1.5 lehman 2008-01-01 2009-12-31 06-2008 06 
+5
source

Here is another possible way to achieve this. I follow your sample data. For all the names in profile , you have the same end_date , which is December 31, 2009. The earliest start_date is January 1, 2008. These two things are in my assumptions for the following code. Therefore, if your data is different from the sample data, the following will not be good.

I tried creating date sequences using do() . Since I used group_by() , start_date and end_date were repeated according to the length of myear . Here I created a sequence of dates by month and converted the dates to the format you specified, namely the year and month (for example, 01-2008). myear therefore has character. Once this work was done, I created tid . Regardless, the final number is 24 for all levels in profile . So I did simple math. You want to know how many lines exist for each profile level. Let me take a look at picasso. Start_date is February 2009, which is considered the 14th month since January 2008. Thus, you have 11 lines for picaso, which means n () = 11. Therefore, (1 + (24 - 11)): 24 creates a number sequence starting at 14 and ending at 24. I leave part of your output below.

 library(dplyr) group_by(df, profile) %>% do(data.frame(start_date = .$start_date, end_date = .$end_date, myear = format(seq(from = .$start_date, to = .$end_date, by = "months"), "%m-%Y") ) ) %>% mutate(tid = (1 + (24 - n())):24) #69 picasso 2009-02-02 2009-12-31 02-2009 14 #70 picasso 2009-02-02 2009-12-31 03-2009 15 #71 picasso 2009-02-02 2009-12-31 04-2009 16 #72 picasso 2009-02-02 2009-12-31 05-2009 17 #73 picasso 2009-02-02 2009-12-31 06-2009 18 #74 picasso 2009-02-02 2009-12-31 07-2009 19 #75 picasso 2009-02-02 2009-12-31 08-2009 20 #76 picasso 2009-02-02 2009-12-31 09-2009 21 #77 picasso 2009-02-02 2009-12-31 10-2009 22 #78 picasso 2009-02-02 2009-12-31 11-2009 23 #79 picasso 2009-02-02 2009-12-31 12-2009 24 

DATA

 structure(list(profile = structure(c(2L, 1L, 3L, 5L, 4L), .Label = c("john", "lehman", "oliver", "picasso", "stephen"), class = "factor"), start_date = structure(c(1199113200, 1201878000, 1207062000, 1220281200, 1233500400), class = c("POSIXct", "POSIXt"), tzone = ""), end_date = structure(c(1262185200, 1262185200, 1262185200, 1262185200, 1262185200), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("profile", "start_date", "end_date"), row.names = c(NA, -5L), class = "data.frame") 
+2
source

This solution is based on functions from tidyverse , lubridate and stringr .

Update

I misunderstood the definition of tid . Now the code should calculate tid , as expected. tid shows the total number of records, but the beginning of tid is the earliest month of the earliest year, and myear is the month and year information combined together.

 library(tidyverse) library(lubridate) library(stringr) df2 <- df %>% mutate(start_date = ymd(start_date), end_date = ymd(end_date)) %>% mutate(start_year = year(start_date), end_year = year(end_date), start_month = month(start_date), end_month = month(end_date)) %>% mutate(Year = map2(start_year, end_year, `:`)) %>% unnest() %>% group_by(profile) %>% mutate(first_year = ifelse(Year == min(Year), TRUE, FALSE), last_year = ifelse(Year == max(Year), TRUE, FALSE)) %>% mutate(start_month = ifelse(!first_year, 1, start_month), end_month = ifelse(!last_year, 12, end_month)) %>% mutate(Month = map2(start_month, end_month, `:`)) %>% unnest() %>% mutate(endid = n() + Month - 1) %>% mutate(tid = first(Month):first(endid)) %>% mutate(Multiple_Year = ifelse(length(unique(Year)) > 1, TRUE, FALSE)) %>% ungroup() %>% mutate(tid = ifelse(Year > min(Year) & !Multiple_Year, tid + 12 * (Year - min(Year)), tid)) %>% mutate(tid = str_pad(tid, width = 2, pad = "0")) %>% mutate(Month = str_pad(Month, width = 2, pad = "0")) %>% mutate(myear = paste(Month, Year, sep = "-")) %>% select(profile, start_date, end_date, tid, myear) 

Output

Now look at some of the df2 output to see if the code is working as expected.

The first two lines of lehman

 df2 %>% filter(profile %in% "lehman") %>% head(2) # A tibble: 2 x 5 profile start_date end_date tid myear <fctr> <date> <date> <chr> <chr> 1 lehman 2008-01-01 2009-12-31 01 01-2008 2 lehman 2008-01-01 2009-12-31 02 02-2008 

The last line of lehman

 df2 %>% filter(profile %in% "lehman") %>% tail(1) # A tibble: 1 x 5 profile start_date end_date tid myear <fctr> <date> <date> <chr> <chr> 1 lehman 2008-01-01 2009-12-31 24 12-2009 

The first two lines of Picasso

 df2 %>% filter(profile %in% "picasso") %>% head(2) # A tibble: 2 x 5 profile start_date end_date tid myear <fctr> <date> <date> <chr> <chr> 1 picasso 2009-02-02 2009-12-31 14 02-2009 2 picasso 2009-02-02 2009-12-31 15 03-2009 

Data preparation

 profile <- c('lehman', 'john','oliver','stephen','picasso') start_date <- c("2008-01-01", "2008-02-02", "2008-04-02", "2008-09-02", "2009-02-02") end_date <- c("2009-12-31", "2009-12-31", "2009-12-31", "2009-12-31", "2009-12-31") df <- data.frame(profile,start_date,end_date) 
+1
source

I know you accepted the answer, but for completeness the data.table method also works:

 dt <- data.table(df) dt.l <- setDT(dt)[ , list(myear = seq(start_date, end_date, by = "1 month"), by = profile] dt.l <- dt.l[ ,tid := ifelse(as.numeric(year(myear)) > 2008, as.numeric(month(myear)) + 12, as.numeric(month(myear)))] dt.l <- setDT(dt.l)[, myear := format(as.Date(myear), "%Y-%m")] 
0
source

Source: https://habr.com/ru/post/1268808/


All Articles