How can I fill in NA values ​​based on the next valid value, but split that value between previous NSs?

Please note: this is a simplified explanation of where the "data" comes from, but where this data is not related to the coding issue.

I have a dataset created by daily collection of water in a tube. I can’t walk and measure the tube every day (but the tube continues to fill), so there are gaps in the water records. This dummy dataset shows where it happened on days 5 and 10 because it is a dummy dataset. I made the assumption that every day 500 ml of water flows into the tube (the real data set is much more dirty!)

dummy data

day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
value<-c(500,500,500,500,NA,1000,NA,NA,NA,2000,500,500)
df<-data.frame(day,value)

Data Explanation: I collected every day for 1: 4 days, so the value for each day is 500 ml, the missed day is 5, so the NA value collected on the 6th day, so the value is 1000 ml (water from the 5th day and day 6 combined), missed 7.8.9, so the values ​​are NA collected on day 10 to give a value of 2000 ml in 4 days), then they are collected every day for the last two)

I would like to fill in the NA spaces by taking the value of the next “real” dimension and dividing this value between NA and this day of the value. Yes, I suppose that if I didn’t take the measurement, it’s an ongoing process and that I can divide the last measurement equally between days.

this is what the output should look like

day<-c(1,2,3,4,5,6,7,8,9,10,11,12)
corrected.value<-c(500,500,500,500,500,500,500,500,500,500,500,500)
corrected.df<-data.frame(day,corrected.value)

, , NA 500 "value[is.na(value)] <- 500", 457,6, 779, 376 .. , ... , ?

+4
1

:

# Create test Data: 
# note that this is slightly different from your input
# but in this way you can better verify that it works as expected
day<-c(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15)
value<-c(NA,500,500,500,NA,3000,NA,NA,NA,5000,500,500,NA,NA,NA)
df<-data.frame(day,value)


# "Cleansing" starts here :
RLE <- rle(is.na(df$value))

# we cannot do anything if last values are NAs, we'll just keep them in the data.frame
if(tail(RLE$values,1)){
  RLE$lengths <- head(RLE$lengths,-1)
  RLE$values <- head(RLE$values,-1)
}

afterNA <- cumsum(RLE$lengths)[RLE$values] + 1
firstNA <- (cumsum(RLE$lengths)- RLE$lengths + 1)[RLE$values]
occurences <- afterNA - firstNA + 1
replacements <- df$value[afterNA] / occurences

df$value[unlist(Map(f=seq.int,firstNA,afterNA))] <- rep.int(replacements,occurences)

:

> df
   day value
1    1   250
2    2   250
3    3   500
4    4   500
5    5  1500
6    6  1500
7    7  1250
8    8  1250
9    9  1250
10  10  1250
11  11   500
12  12   500
13  13    NA
14  14    NA
15  15    NA
+4

Source: https://habr.com/ru/post/1660978/


All Articles