Split data frame into overlapping data frames

I am trying to write a function that behaves as follows, but it is very complicated:

DF <- data.frame(x = seq(1,10), y = rep(c('a','b','c','d','e'),2)) > DF xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a 7 7 b 8 8 c 9 9 d 10 10 e >OverLapSplit(DF,nsplits=2,overlap=2) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a [[2]] xy 1 5 a 2 6 b 3 7 c 4 8 d 5 9 e 6 10 a >OverLapSplit(DF,nsplits=1) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a 7 7 b 8 8 c 9 9 d 10 10 e >OverLapSplit(DF,nsplits=2,overlap=4) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a 7 7 b [[2]] xy 1 4 e 2 5 a 3 6 b 4 7 c 5 8 d 6 9 e 7 10 a >OverLapSplit(DF,nsplits=5,overlap=1) [[1]] xy 1 1 a 2 2 b 3 3 c [[2]] xy 1 3 c 2 4 d 3 5 e [[3]] xy 1 5 e 2 6 a 3 7 b [[4]] xy 1 7 b 2 8 c 3 9 d [[5]] xy 1 8 d 2 9 e 3 10 f 

I did not think much about what would happen if you try something like OverLapSplit(DF,nsplits=2,overlap=1)

The following are possible:

 [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e [[2]] xy 1 5 a 2 6 b 3 7 c 4 8 d 5 9 e 6 10 a 

Thanks!

+6
source share
3 answers

Try something like:

 OverlapSplit <- function(x,nsplit=1,overlap=2){ nrows <- NROW(x) nperdf <- ceiling( (nrows + overlap*nsplit) / (nsplit+1) ) start <- seq(1, nsplit*(nperdf-overlap)+1, by= nperdf-overlap ) if( start[nsplit+1] + nperdf != nrows ) warning("Returning an incomplete dataframe.") lapply(start, function(i) x[c(i:(i+nperdf-1)),]) } 

with nsplit the number of breaks! (nsplit = 1 returns 2 data frames). This will make the last data frame incomplete if the overlapping partitions really do not fit in the data frame and gives a warning.

 > OverlapSplit(DF,nsplit=3,overlap=2) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d [[2]] xy 3 3 c 4 4 d 5 5 e 6 6 a [[3]] xy 5 5 e 6 6 a 7 7 b 8 8 c [[4]] xy 7 7 b 8 8 c 9 9 d 10 10 e 

And one with a warning

 > OverlapSplit(DF,nsplit=1,overlap=1) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a [[2]] xy 6 6 a 7 7 b 8 8 c 9 9 d 10 10 e NA NA <NA> Warning message: In OverlapSplit(DF, nsplit = 1, overlap = 1) : Returning an incomplete dataframe. 
+6
source

This uses the idea of ​​pebbles from Lattice graphics and therefore uses the code from the lattice package to generate the intervals, and then uses the loop to split the original DF into the correct subsets.

I was not exactly sure what is meant by overlap = 1 - I assume that you meant overlapping by 1 pattern / observation. If so, then the code below does this.

 OverlapSplit <- function(x, nsplits = 1, overlap = 0) { stopifnot(require(lattice)) N <- seq_len(nr <- nrow(x)) interv <- co.intervals(N, nsplits, overlap / nr) out <- vector(mode = "list", length = nrow(interv)) for(i in seq_along(out)) { out[[i]] <- x[interv[i,1] < N & N < interv[i,2], , drop = FALSE] } out } 

What gives:

 > OverlapSplit(DF, 2, 2) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a [[2]] xy 5 5 e 6 6 a 7 7 b 8 8 c 9 9 d 10 10 e > OverlapSplit(DF) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a 7 7 b 8 8 c 9 9 d 10 10 e > OverlapSplit(DF, 4, 1) [[1]] xy 1 1 a 2 2 b 3 3 c [[2]] xy 3 3 c 4 4 d 5 5 e [[3]] xy 6 6 a 7 7 b 8 8 c [[4]] xy 8 8 c 9 9 d 10 10 e 
+4
source

Just to understand what I'm doing here:

 #Load Libraries library(PerformanceAnalytics) library(quantmod) #Function to Split Data Frame OverlapSplit <- function(x,nsplit=1,overlap=0){ nrows <- NROW(x) nperdf <- ceiling( (nrows + overlap*nsplit) / (nsplit+1) ) start <- seq(1, nsplit*(nperdf-overlap)+1, by= nperdf-overlap ) if( start[nsplit+1] + nperdf != nrows ) warning("Returning an incomplete dataframe.") lapply(start, function(i) x[c(i:(i+nperdf-1)),]) } #Function to run regression on 30 days to predict the next day FL <- as.formula(Next(HAM1)~HAM1+HAM2+HAM3+HAM4) MyRegression <- function(df,FL) { df <- as.data.frame(df) model <- lm(FL,data=df[1:30,]) predict(model,newdata=df[31,]) } #Function to roll the regression RollMyRegression <- function(data,ModelFUN,FL) { rollapply(data, width=31,FUN=ModelFUN,FL, by.column = FALSE, align = "right", na.pad = FALSE) } #Load Data data(managers) #Split Dataset split.data <- OverlapSplit(managers,2,30) sapply(split.data,dim) #Run rolling regression on each split output <- lapply(split.data,RollMyRegression,MyRegression,FL) output unlist(output) 

Thus, you can replace lapply at the end with a parallel version of lapply and increase the speed a bit.

Of course, now there is the problem of optimizing separation / overlap, given the number of processors and the size of your data set.

0
source

Source: https://habr.com/ru/post/885847/


All Articles