The fastest way to fill in missing dates for data.table

Question

The fastest way to fill in missing dates for data.table

I load data.tablefrom a CSV file with fields of date, orders, quantity, etc.

The input file sometimes does not have data for all dates. For example, as shown below:

> NADayWiseOrders
           date orders  amount guests
  1: 2013-01-01     50 2272.55    149
  2: 2013-01-02      3   64.04      4
  3: 2013-01-04      1   18.81      0
  4: 2013-01-05      2   77.62      0
  5: 2013-01-07      2   35.82      2

In the above 03 Jan and Jan 6 No records.

I would like to fill in the missing entries with default values (for example, zero for orders, amounts, etc.) or transfer the last transfer (for example, Jan 03, will reuse the values Jan 02, and 06 Jan will reuse 05- Jan. Values, etc.)

What is the best / best way to fill in such missing date spaces with such default values?

The answer here proposes to use allow.cartesian = TRUEand expand.gridfor missing work days - it can run on weekdays (since they are only 7 working days), but not sure if it was the right way to go on dates, especially if we are dealing with long-term data.

+4

datetime r data.table

Gopalakrishna palem Apr 9 '14 at 8:25

source share

3 answers

The idiomatic method data.table(using sliding compounds) is as follows:

setkey(NADayWiseOrders, date)
all_dates <- seq(from = as.Date("2013-01-01"), 
                   to = as.Date("2013-01-07"), 
                   by = "days")

NADayWiseOrders[J(all_dates), roll=Inf]
         date orders  amount guests
1: 2013-01-01     50 2272.55    149
2: 2013-01-02      3   64.04      4
3: 2013-01-03      3   64.04      4
4: 2013-01-04      1   18.81      0
5: 2013-01-05      2   77.62      0
6: 2013-01-06      2   77.62      0
7: 2013-01-07      2   35.82      2

+7

Arun Apr 9 '14 at 11:50

source share

Here's how you fill in the gaps within a subgroup

# a toy dataset with gaps in the time series
dt <- as.data.table(read.csv(textConnection('"group","date","x"
"a","2017-01-01",1
"a","2017-02-01",2
"a","2017-05-01",3
"b","2017-02-01",4
"b","2017-04-01",5')))
dt[,date := as.Date(date)]

# the desired dates by group
indx <- dt[,.(date=seq(min(date),max(date),"months")),group]

# key the tables and join them using a rolling join
setkey(dt,group,date)
setkey(indx,group,date)
dt[indx,roll=TRUE]

#>    group       date x
#> 1:     a 2017-01-01 1
#> 2:     a 2017-02-01 2
#> 3:     a 2017-03-01 2
#> 4:     a 2017-04-01 2
#> 5:     a 2017-05-01 3
#> 6:     b 2017-02-01 4
#> 7:     b 2017-03-01 4
#> 8:     b 2017-04-01 5

+1

Jthorpe Mar 05 '18 at 23:55

source share

shadow · Accepted Answer · 2014-04-09T09:00:27+0000

Not sure if this is the fastest, but it will work if there is no data NA:

# just in case these aren't Dates. 
NADayWiseOrders$date <- as.Date(NADayWiseOrders$date)
# all desired dates.
alldates <- data.table(date=seq.Date(min(NADayWiseOrders$date), max(NADayWiseOrders$date), by="day"))
# merge
dt <- merge(NADayWiseOrders, alldates, by="date", all=TRUE)
# now carry forward last observation (alternatively, set NA to 0)
require(xts)
na.locf(dt)

The fastest way to fill in missing dates for data.table

More articles: