Split data frame into overlapping data frames

Question

Split data frame into overlapping data frames

I am trying to write a function that behaves as follows, but it is very complicated:

DF <- data.frame(x = seq(1,10), y = rep(c('a','b','c','d','e'),2)) > DF xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a 7 7 b 8 8 c 9 9 d 10 10 e >OverLapSplit(DF,nsplits=2,overlap=2) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a [[2]] xy 1 5 a 2 6 b 3 7 c 4 8 d 5 9 e 6 10 a >OverLapSplit(DF,nsplits=1) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a 7 7 b 8 8 c 9 9 d 10 10 e >OverLapSplit(DF,nsplits=2,overlap=4) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a 7 7 b [[2]] xy 1 4 e 2 5 a 3 6 b 4 7 c 5 8 d 6 9 e 7 10 a >OverLapSplit(DF,nsplits=5,overlap=1) [[1]] xy 1 1 a 2 2 b 3 3 c [[2]] xy 1 3 c 2 4 d 3 5 e [[3]] xy 1 5 e 2 6 a 3 7 b [[4]] xy 1 7 b 2 8 c 3 9 d [[5]] xy 1 8 d 2 9 e 3 10 f

I did not think much about what would happen if you try something like OverLapSplit(DF,nsplits=2,overlap=1)

The following are possible:

 [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e [[2]] xy 1 5 a 2 6 b 3 7 c 4 8 d 5 9 e 6 10 a

Thanks!

+6

r dataframe data-manipulation data-management

Zach Apr 13 '11 at 18:23

source share

3 answers

This uses the idea of pebbles from Lattice graphics and therefore uses the code from the lattice package to generate the intervals, and then uses the loop to split the original DF into the correct subsets.

I was not exactly sure what is meant by overlap = 1 - I assume that you meant overlapping by 1 pattern / observation. If so, then the code below does this.

 OverlapSplit <- function(x, nsplits = 1, overlap = 0) { stopifnot(require(lattice)) N <- seq_len(nr <- nrow(x)) interv <- co.intervals(N, nsplits, overlap / nr) out <- vector(mode = "list", length = nrow(interv)) for(i in seq_along(out)) { out[[i]] <- x[interv[i,1] < N & N < interv[i,2], , drop = FALSE] } out }

What gives:

 > OverlapSplit(DF, 2, 2) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a [[2]] xy 5 5 e 6 6 a 7 7 b 8 8 c 9 9 d 10 10 e > OverlapSplit(DF) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a 7 7 b 8 8 c 9 9 d 10 10 e > OverlapSplit(DF, 4, 1) [[1]] xy 1 1 a 2 2 b 3 3 c [[2]] xy 3 3 c 4 4 d 5 5 e [[3]] xy 6 6 a 7 7 b 8 8 c [[4]] xy 8 8 c 9 9 d 10 10 e

+4

Gavin simpson Apr 13 '11 at 20:02

source share

Just to understand what I'm doing here:

 #Load Libraries library(PerformanceAnalytics) library(quantmod) #Function to Split Data Frame OverlapSplit <- function(x,nsplit=1,overlap=0){ nrows <- NROW(x) nperdf <- ceiling( (nrows + overlap*nsplit) / (nsplit+1) ) start <- seq(1, nsplit*(nperdf-overlap)+1, by= nperdf-overlap ) if( start[nsplit+1] + nperdf != nrows ) warning("Returning an incomplete dataframe.") lapply(start, function(i) x[c(i:(i+nperdf-1)),]) } #Function to run regression on 30 days to predict the next day FL <- as.formula(Next(HAM1)~HAM1+HAM2+HAM3+HAM4) MyRegression <- function(df,FL) { df <- as.data.frame(df) model <- lm(FL,data=df[1:30,]) predict(model,newdata=df[31,]) } #Function to roll the regression RollMyRegression <- function(data,ModelFUN,FL) { rollapply(data, width=31,FUN=ModelFUN,FL, by.column = FALSE, align = "right", na.pad = FALSE) } #Load Data data(managers) #Split Dataset split.data <- OverlapSplit(managers,2,30) sapply(split.data,dim) #Run rolling regression on each split output <- lapply(split.data,RollMyRegression,MyRegression,FL) output unlist(output)

Thus, you can replace lapply at the end with a parallel version of lapply and increase the speed a bit.

Of course, now there is the problem of optimizing separation / overlap, given the number of processors and the size of your data set.

0

Zach Apr 13 '11 at 21:07

source share

Joris meys · Accepted Answer · 2011-04-13T20:00:32+0000

Try something like:

 OverlapSplit <- function(x,nsplit=1,overlap=2){ nrows <- NROW(x) nperdf <- ceiling( (nrows + overlap*nsplit) / (nsplit+1) ) start <- seq(1, nsplit*(nperdf-overlap)+1, by= nperdf-overlap ) if( start[nsplit+1] + nperdf != nrows ) warning("Returning an incomplete dataframe.") lapply(start, function(i) x[c(i:(i+nperdf-1)),]) }

with nsplit the number of breaks! (nsplit = 1 returns 2 data frames). This will make the last data frame incomplete if the overlapping partitions really do not fit in the data frame and gives a warning.

 > OverlapSplit(DF,nsplit=3,overlap=2) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d [[2]] xy 3 3 c 4 4 d 5 5 e 6 6 a [[3]] xy 5 5 e 6 6 a 7 7 b 8 8 c [[4]] xy 7 7 b 8 8 c 9 9 d 10 10 e

And one with a warning

 > OverlapSplit(DF,nsplit=1,overlap=1) [[1]] xy 1 1 a 2 2 b 3 3 c 4 4 d 5 5 e 6 6 a [[2]] xy 6 6 a 7 7 b 8 8 c 9 9 d 10 10 e NA NA <NA> Warning message: In OverlapSplit(DF, nsplit = 1, overlap = 1) : Returning an incomplete dataframe.

Split data frame into overlapping data frames

More articles: