Simulating time series in dplyr instead of using a for loop

So while lag and lead in dplyr are great, I want to simulate time series of something like population growth. My old school code would look something like this:

 tdf <- data.frame(time=1:5, pop=50) for(i in 2:5){ tdf$pop[i] = 1.1*tdf$pop[i-1] } 

which produces

  time pop 1 1 50.000 2 2 55.000 3 3 60.500 4 4 66.550 5 5 73.205 

I feel there must be a way dplyr or tidyverse to do this (as much as I like my for loop).

But something like

 tdf <- data.frame(time=1:5, pop=50) %>% mutate(pop = 1.1*lag(pop)) 

which would be my first guess, just produces

  time pop 1 1 NA 2 2 55 3 3 55 4 4 55 5 5 55 

I feel like I'm missing something obvious ... what is it?

Note. This is a trivial example. My actual examples use several parameters, many of which change over time (I model forecasts in different GCM scenarios), so tidyverse is a powerful tool for bringing my simulations together.

+5
source share
5 answers

Reduce (or its purrr variants, if you want) is what you want for cumulative functions that don't yet have the cum* version written:

 data.frame(time = 1:5, pop = 50) %>% mutate(pop = Reduce(function(x, y){x * 1.1}, pop, accumulate = TRUE)) ## time pop ## 1 1 50.000 ## 2 2 55.000 ## 3 3 60.500 ## 4 4 66.550 ## 5 5 73.205 

or with purrr,

 data.frame(time = 1:5, pop = 50) %>% mutate(pop = accumulate(pop, ~.x * 1.1)) ## time pop ## 1 1 50.000 ## 2 2 55.000 ## 3 3 60.500 ## 4 4 66.550 ## 5 5 73.205 
+7
source

If the initial value of pop is, say, 50, then pop = 50 * 1.1^(0:4) will provide you with the following four values. With your code, you can do:

 data.frame(time=1:5, pop=50) %>% mutate(pop = pop * 1.1^(1:n() - 1)) 

Or,

 base = 50 data.frame(time=1:5) %>% mutate(pop = base * 1.1^(1:n()-1)) 
+5
source

The accumulation function Purrr can handle time-varying indexes if you pass them to your simulation function as a list with all the parameters in it. However, it takes a bit of controversy to work properly. The trick here is that accumulate () can work both in a list and in vector columns. You can use the tidyr function nest () to group the columns into a list vector containing the current state and population parameters, and then use accumulate () in the resulting column of the list. This is a little difficult to explain, so I included a demo simulating logistic growth with a constant growth rate or a stochastic growth rate that changes over time. I also included an example of using this method to model multiple replicas for a given model using dpylr + purrr + tidyr.

 library(dplyr) library(purrr) library(ggplot2) library(tidyr) # Declare the population growth function. Note: the first two arguments # have to be .x (the prior vector of populations and parameters) and .y, # the current parameter value and population vector. # This example function is a Ricker population growth model. logistic_growth = function(.x, .y, growth, comp) { pop = .x$pop[1] growth = .y$growth[1] comp = .y$comp[1] # Note: this uses the state from .x, and the parameter values from .y. # The first observation will use the first entry in the vector for .x and .y new_pop = pop*exp(growth - pop*comp) .y$pop[1] = new_pop return(.y) } # Starting parameters the number of time steps to simulate, initial population size, # and ecological parameters (growth rate and intraspecific competition rate) n_steps = 100 pop_init = 1 growth = 0.5 comp = 0.05 #First test: fixed growth rates test1 = data_frame(time = 1:n_steps,pop = pop_init, growth=growth,comp =comp) # here, the combination of nest() and group_by() split the data into individual # time points and then groups all parameters into a new vector called state. # ungroup() removes the grouping structure, then accumulate runs the function #on the vector of states. Finally unnest transforms it all back to a #data frame out1 = test1 %>% group_by(time)%>% nest(pop, growth, comp,.key = state)%>% ungroup()%>% mutate( state = accumulate(state,logistic_growth))%>% unnest() # This is the same example, except I drew the growth rates from a normal distribution # with a mean equal to the mean growth rate and a std. dev. of 0.1 test2 = data_frame(time = 1:n_steps,pop = pop_init, growth=rnorm(n_steps, growth,0.1),comp=comp) out2 = test2 %>% group_by(time)%>% nest(pop, growth, comp,.key = state)%>% ungroup()%>% mutate( state = accumulate(state,logistic_growth))%>% unnest() # This demostrates how to use this approach to simulate replicates using dplyr # Note the crossing function creates all combinations of its input values test3 = crossing(rep = 1:10, time = 1:n_steps,pop = pop_init, comp=comp) %>% mutate(growth=rnorm(n_steps*10, growth,0.1)) out3 = test3 %>% group_by(rep)%>% group_by(rep,time)%>% nest(pop, growth, comp,.key = state)%>% group_by(rep)%>% mutate( state = accumulate(state,logistic_growth))%>% unnest() print(qplot(time, pop, data=out1)+ geom_line() + geom_point(data= out2, col="red")+ geom_line(data=out2, col="red")+ geom_point(data=out3, col="red", alpha=0.1)+ geom_line(data=out3, col="red", alpha=0.1,aes(group=rep))) 
+3
source

What about map functions, i.e.

 tdf <- data_frame(time=1:5) tdf %>% mutate(pop = map_dbl(.x = tdf$time, .f = (function(x) 50*1.1^x))) 
+1
source

The problem is that dplyr does this as a set of vector operations, rather than evaluating the term one at a time. Here 1.1*lag(pop) interpreted as "calculates lagging values ​​for all pops and then multiplies them by 1.1". Since you set pop=50 lagged the values ​​for all steps were 50.

dplyr has some helper functions for sequential evaluation; standard function cumsum , cumprod etc. work, and several new ones (see ?cummean ) work within dplyr . In your example, you can model the model with:

 tdf <- data.frame(time=1:5, pop=50, growth_rate = c(1, rep(1.1,times=4)) %>% mutate(pop = pop*cumprod(growth_rate)) time pop growth_rate 1 50.000 1.0 2 55.000 1.1 3 60.500 1.1 4 66.550 1.1 5 73.205 1.1 

Notice that I added the growth rate as a column here, and I set the first growth rate to 1. You can also specify it as follows:

 tdf <- data.frame(time=1:5, pop=50, growth_rate = 1.1) %>% mutate(pop = pop*cumprod(lead(growth_rate,default=1)) 

This makes it clear that the growth rate column refers to the growth rate at the current time step from the previous one.

There are restrictions on how many different simulations you can do this way, but it should be possible to build many discrete environmental models using some combination of cumulative functions and parameters listed in the columns.

+1
source

Source: https://habr.com/ru/post/1258366/


All Articles