Fill in the remaining sequence values with dplyr

Question

Fill in the remaining sequence values with dplyr

I have a data frame with missing values for "SNAP_ID". I would like to fill in the missing values with floating point values based on the sequence from the previous missing value (lag ()?). I would really like to achieve this using only dplyr, if possible.

Assumptions:

There will never be data missing as the first or last row. I generate missing dates based on missing days between the minimum and maximum dataset.
There may be several spaces in the data set.

Current data:

end SNAP_ID 1 2015-06-26 12:59:00 365 2 2015-06-26 13:59:00 366 3 2015-06-27 00:01:00 NA 4 2015-06-27 23:00:00 NA 5 2015-06-28 00:01:00 NA 6 2015-06-28 23:00:00 NA 7 2015-06-29 09:00:00 367 8 2015-06-29 09:59:00 368

What I want to achieve:

  end SNAP_ID 1 2015-06-26 12:59:00 365.0 2 2015-06-26 13:59:00 366.0 3 2015-06-27 00:01:00 366.1 4 2015-06-27 23:00:00 366.2 5 2015-06-28 00:01:00 366.3 6 2015-06-28 23:00:00 366.4 7 2015-06-29 09:00:00 367.0 8 2015-06-29 09:59:00 368.0

As a data frame:

 df <- structure(list(end = structure(c(1435323540, 1435327140, 1435363260, 1435446000, 1435449660, 1435532400, 1435568400, 1435571940), tzone = "UTC", class = c("POSIXct", "POSIXt")), SNAP_ID = c(365, 366, NA, NA, NA, NA, 367, 368)), .Names = c("end", "SNAP_ID"), row.names = c(NA, -8L), class = "data.frame")

It was my attempt to achieve this goal, but it only works for the first missing value:

 df %>% arrange(end) %>% mutate(SNAP_ID=ifelse(is.na(SNAP_ID),lag(SNAP_ID)+0.1,SNAP_ID)) end SNAP_ID 1 2015-06-26 12:59:00 365.0 2 2015-06-26 13:59:00 366.0 3 2015-06-27 00:01:00 366.1 4 2015-06-27 23:00:00 NA 5 2015-06-28 00:01:00 NA 6 2015-06-28 23:00:00 NA 7 2015-06-29 09:00:00 367.0 8 2015-06-29 09:59:00 368.0

The outstanding answer from @ Mathematical.coffee is below:

 df %>% arrange(end) %>% group_by(tmp=cumsum(!is.na(SNAP_ID))) %>% mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) %>% ungroup() %>% select(-tmp)

+4

r dplyr

Tyler muth Jul 16 '15 at 10:08

source share

1 answer

mathematical.coffee · Accepted Answer · 2015-07-16T22:55:36+0000

EDIT: The new version works for any number of NA runs. This does not require zoo .

First, note that tmp=cumsum(!is.na(SNAP_ID)) groups SNAP_ID , such groups from the same tmp consist of a single value other than NA, followed by the execution of NA values.

Then group this variable and just add .1 to the first SNAP_ID to populate NA:

 df %>% arrange(end) %>% group_by(tmp=cumsum(!is.na(SNAP_ID))) %>% mutate(SNAP_ID=SNAP_ID[1] + 0.1*(0:(length(SNAP_ID)-1))) end SNAP_ID tmp 1 2015-06-26 12:59:00 365.0 1 2 2015-06-26 13:59:00 366.0 2 3 2015-06-27 00:01:00 366.1 2 4 2015-06-27 23:00:00 366.2 2 5 2015-06-28 00:01:00 366.3 2 6 2015-06-28 23:00:00 366.4 2 7 2015-06-29 09:00:00 367.0 3 8 2015-06-29 09:59:00 368.0 4

Then you can leave the tmp column after that (add %>% select(-tmp) to the end).

EDIT: This is an old version that does not work for subsequent NA s runs.

If your goal is to fill each NA with the previous value + 0.1, you can use zoo na.locf (which fills each NA previous value) along with cumsum(is.na(SNAP_ID))*0.1 to add extra 0.1.

 library(zoo) df %>% arrange(end) %>% mutate(SNAP_ID=ifelse(is.na(SNAP_ID), na.locf(SNAP_ID) + cumsum(is.na(SNAP_ID))*0.1, SNAP_ID))

Fill in the remaining sequence values ​​with dplyr

More articles:

Fill in the remaining sequence values with dplyr