I have a data frame with missing values ββfor "SNAP_ID". I would like to fill in the missing values ββwith floating point values ββbased on the sequence from the previous missing value (lag ()?). I would really like to achieve this using only dplyr, if possible.
Assumptions:
- There will never be data missing as the first or last row. I generate missing dates based on missing days between the minimum and maximum dataset.
- There may be several spaces in the data set.
Current data:
end SNAP_ID 1 2015-06-26 12:59:00 365 2 2015-06-26 13:59:00 366 3 2015-06-27 00:01:00 NA 4 2015-06-27 23:00:00 NA 5 2015-06-28 00:01:00 NA 6 2015-06-28 23:00:00 NA 7 2015-06-29 09:00:00 367 8 2015-06-29 09:59:00 368
What I want to achieve:
end SNAP_ID 1 2015-06-26 12:59:00 365.0 2 2015-06-26 13:59:00 366.0 3 2015-06-27 00:01:00 366.1 4 2015-06-27 23:00:00 366.2 5 2015-06-28 00:01:00 366.3 6 2015-06-28 23:00:00 366.4 7 2015-06-29 09:00:00 367.0 8 2015-06-29 09:59:00 368.0
As a data frame:
df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260, 1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct", "POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end", "SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")
It was my attempt to achieve this goal, but it only works for the first missing value:
df %>% arrange(end) %>% mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID)) end SNAP_ID 1 2015-06-26 12:59:00 365.0 2 2015-06-26 13:59:00 366.0 3 2015-06-27 00:01:00 366.1 4 2015-06-27 23:00:00 NA 5 2015-06-28 00:01:00 NA 6 2015-06-28 23:00:00 NA 7 2015-06-29 09:00:00 367.0 8 2015-06-29 09:59:00 368.0
The outstanding answer from @ Mathematical.coffee is below:
df %>% arrange(end) %>% group_by(tmp=cumsum(!is.na(SNAP_ID))) %>% mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>% ungroup() %>% select(-tmp)