Next Post Index

I have a sample dataset of the path of one bike. My goal is to find out, on average, the time that elapses between visits to station B.

So far, I could just order a dataset with:

test[order(test$starttime, decreasing = FALSE),] 

and find the line index where start_station and end_station are B.

  which(test$start_station == 'B') which(test$end_station == 'B') 

In the next part, I ran into a problem. To calculate the time that elapses between them when the bike is at station B, we need to take difftime() between where start_station = "B" (bike leaves) and the next meeting record , where end_station= "B" , even if the record is on the same line (see line 6).

Using the data set below, we know that the bike spent 510 minutes between 7:30:00 and 16:00:00 outside station B, 30 minutes between 18:30:00 and 18:30:00 outside station B and 210 minutes between 19:00:00 and 22:30:00 outside station B, which averages 250 minutes.

How to reproduce this output in R using difftime() ?

 > test bikeid start_station starttime end_station endtime 1 1 A 2017-09-25 01:00:00 B 2017-09-25 01:30:00 2 1 B 2017-09-25 07:30:00 C 2017-09-25 08:00:00 3 1 C 2017-09-25 10:00:00 A 2017-09-25 10:30:00 4 1 A 2017-09-25 13:00:00 C 2017-09-25 13:30:00 5 1 C 2017-09-25 15:30:00 B 2017-09-25 16:00:00 6 1 B 2017-09-25 18:00:00 B 2017-09-25 18:30:00 7 1 B 2017-09-25 19:00:00 A 2017-09-25 19:30:00 8 1  2017-09-25 20:00:00 C 2017-09-25 20:30:00 9 1 C 2017-09-25 22:00:00 B 2017-09-25 22:30:00 10 1 B 2017-09-25 23:00:00 C 2017-09-25 23:30:00 

Here is an example of data:

 > dput(test) structure(list(bikeid = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1), start_station = c("A", "B", "C", "A", "C", "B", "B", "", "C", "B"), starttime = structure(c(1506315600, 1506339000, 1506348000, 1506358800, 1506367800, 1506376800, 1506380400, 1506384000, 1506391200, 1506394800), class = c("POSIXct", "POSIXt" ), tzone = ""), end_station = c("B", "C", "A", "C", "B", "B", "A", "C", "B", "C"), endtime = structure(c(1506317400, 1506340800, 1506349800, 1506360600, 1506369600, 1506378600, 1506382200, 1506385800, 1506393000, 1506396600), class = c("POSIXct", "POSIXt"), tzone = "")), .Names = c("bikeid", "start_station", "starttime", "end_station", "endtime"), row.names = c(NA, -10L), class = "data.frame") 
+5
source share
2 answers

This will calculate the difference as given in the order of occurrence, but will not add it to data.frame

 lapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"]) [[1]] Time difference of 510 mins [[2]] Time difference of 30 mins [[3]] Time difference of 210 mins [[4]] Time difference of NA mins 

To calculate the average time:

 v1 <- sapply(df1$starttime[df1$start_station == "B"], function(x, et) difftime(et[x < et][1], x, units = "mins"), et = df1$endtime[df1$end_station == "B"]) mean(v1, na.rm = TRUE) [1] 250 
+1
source

Another possibility:

 library(data.table) d <- setDT(test)[ , { start = starttime[start_station == "B"] end = endtime[end_station == "B"] .(start = start, end = end, duration = difftime(end, start, units = "min")) } , by = .(trip = cumsum(start_station == "B"))] d # trip start end duration # 1: 0 <NA> 2017-09-25 01:30:00 NA mins # 2: 1 2017-09-25 07:30:00 2017-09-25 16:00:00 510 mins # 3: 2 2017-09-25 18:00:00 2017-09-25 18:30:00 30 mins # 4: 3 2017-09-25 19:00:00 2017-09-25 22:30:00 210 mins # 5: 4 2017-09-25 23:00:00 <NA> NA mins d[ , mean(duration, na.rm = TRUE)] # Time difference of 250 mins # or d[ , mean(as.integer(duration), na.rm = TRUE)] # [1] 250 

Data is grouped using a counter, which is incremented by 1 each time the bike starts with "B" ( by = cumsum(start_station == "B") ).

+1
source

Source: https://habr.com/ru/post/1272103/


All Articles