A faster way to count cases in 5-minute segments?

I have an events matrix that contains the times of occurrence of 5 million events. Each of these 5 million events has a โ€œtypeโ€ that ranges from 1 to 2000. A very simplified version of the matrix is โ€‹โ€‹shown below. The units for "times" are seconds since 1970. All events occurred on January 1, 2012.

 >events type times 1 1352861760 1 1362377700 2 1365491820 2 1368216180 2 1362088800 2 1362377700 

I am trying to split the time from 1/1/2012 into 5-minute buckets, and then fill each of these buckets with how much of each type i event happened in each bucket. My code is below. Please note that types is a vector containing every possible type from 1-2000, and by is 300, because this is how many seconds in 5 minutes.

 for(i in 1:length(types)){ local <- events[events$type==types[i],c("type", "times")] assign(sprintf("a%d", i),table(cut(local$times, breaks=seq(range(events$times)[1],range(events$times)[2], by=300)))) } 

This leads to the variables a1 through a2000 , which contains the row vector, how many occurrences of type i were in each of the 5-minute buckets.

Next, I find all pair correlations between 'a1' and 'a2000'.

Is there a way to optimize the piece of code that I cited above? It works very slowly, but I canโ€™t think of a way to do it faster. Perhaps too many buckets and too little time.

Any insight would be greatly appreciated.

Playable example:

 >head(events) type times 12 1308575460 12 1308676680 12 1308825420 12 1309152660 12 1309879140 25 1309946460 xevents <- xts(events[,"type"],.POSIXct(events[,"times"])) ep <- endpoints(xevents, "minutes", 5) counts <- period.apply(xevents, ep, tabulate, nbins=length(types)) >head(counts) 1 2 3 4 5 6 7 8 9 10 11 12 13 14 2011-06-20 09:11:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2011-06-21 13:18:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2011-06-23 06:37:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2011-06-27 01:31:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2011-07-05 11:19:00 0 0 0 0 0 0 0 0 0 0 0 1 0 0 2011-07-06 06:01:00 0 0 0 0 0 0 0 0 0 0 0 0 0 0 >> ep[1:20] [1] 0 1 2 3 4 5 6 7 8 9 10 12 20 21 22 23 24 25 26 27 

Above was the code that I used, but the problem is that it did not increase by 5 minutes: it simply increases upon entry into real events.

+4
source share
3 answers

I would use the xts package for this. Executing a function at non-overlapping 5 minute intervals is easy with the period.apply and endpoints functions.

 # create sample data library(xts) set.seed(21) N <- 1e6 events <- cbind(sample(2000, N, replace=TRUE), as.POSIXct("2012-01-01")+sample(1e7,N)) colnames(events) <- c("type","times") # create xts object xevents <- xts(events[,"type"], .POSIXct(events[,"times"])) # find the last row of each non-overlapping 5-minute interval ep <- endpoints(xevents, "minutes", 5) # count the number of occurrences of each "type" counts <- period.apply(xevents, ep, tabulate, nbins=2000) # set colnames colnames(counts) <- paste0("a",1:ncol(counts)) # calculate correlation #cc <- cor(counts) 

Update to respond to comments / OP changes:

 # Create a sequence of 5-minute steps, from the actual start of the data m5 <- seq(round(start(xevents),'mins'), end(xevents), by='5 mins') # Create a sequence of 5-minute steps, from the start of 2012-01-01 m5 <- seq(as.POSIXct("2012-01-01"), end(xevents), by='5 mins') # merge xevents with empty 5-minute xts object, and # subtract 1 second, so endpoints are at end of each 5-minute interval xevents5 <- merge(xevents, xts(,m5-1)) ep5 <- endpoints(xevents5, "minutes", 5) counts5 <- period.apply(xevents5, ep5, tabulate, nbins=2000) colnames(counts5) <- paste0("a",1:ncol(counts5)) # align to the beginning of each 5-minute interval, if you want counts5 <- align.time(counts5,60*5) 
+3
source

With 5 million records, I would probably use data.table . You can achieve this as follows:

 # First we make a sequence of the buckets from initial time to at least the end time + 5 minutes buckets <- seq( from = min( df$times ) , by = 300 , to = max( df$times )+300 ) require( data.table ) DT <- data.table( df ) # Work out what bucket each time is in DT[ , list( Bucket = which.max(times <= buckets ) ) , by = "times" ] # Aggregate events by type and time bucket DT[ , list( Count = length( type ) ) , by = list( type, bucket) ] type bucket Count 1: 1 1 1 2: 1 31721 1 3: 2 42102 1 4: 2 51183 1 5: 2 30758 1 6: 2 31721 1 
+3
source

cut within range from times , the way you did it. After that, you can use table or xtabs , but for the entire data set, to generate the matrix. Something like the following:

 r <- trunc(range(events$times) / 300) * 300 events$times.bin <- cut(events$times, seq(r[1], r[2] + 300, by=300)) xtabs(~type+times.bin, events, drop.unused.levels=T) 

Decide whether you want drop.unused.levels or not. Using this type of data, you can also create a sparse matrix.

+1
source

Source: https://habr.com/ru/post/1493309/


All Articles