Python Pandas: Time Series Detection Frequency

Suppose I loaded time series data from sql or csv (not created in python), the index would be:

DatetimeIndex(['2015-03-02 00:00:00', '2015-03-02 01:00:00',
               '2015-03-02 02:00:00', '2015-03-02 03:00:00',
               '2015-03-02 04:00:00', '2015-03-02 05:00:00',
               '2015-03-02 06:00:00', '2015-03-02 07:00:00',
               '2015-03-02 08:00:00', '2015-03-02 09:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', name=u'hour', length=3360, freq=None, tz=None)

As you can see, freq is None. I am wondering how I can determine the frequency of this series and set the β€œfrequency” as its frequency.

If possible, I would like this to work if the data is not continuous (there are many gaps in the series).

I tried to find a way of all the differences between the two timestamps, but I'm not sure how to transfer it to the format that Series reads

+4
source share
3 answers

Perhaps try to use the timeindex difference and use the mode (or the smallest difference) as the frequency.

import pandas as pd
import numpy as np

# simulate some data
# ===================================
np.random.seed(0)
dt_rng = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_idx = pd.DatetimeIndex(np.random.choice(dt_rng, size=2000, replace=False))
df = pd.DataFrame(np.random.randn(2000), index=dt_idx, columns=['col']).sort_index()
df

                        col
2015-03-02 01:00:00  2.0261
2015-03-02 04:00:00  1.3325
2015-03-02 05:00:00 -0.9867
2015-03-02 06:00:00 -0.0671
2015-03-02 08:00:00 -1.1131
2015-03-02 09:00:00  0.0494
2015-03-02 10:00:00 -0.8130
2015-03-02 11:00:00  1.8453
...                     ...
2015-07-19 13:00:00 -0.4228
2015-07-19 14:00:00  1.1962
2015-07-19 15:00:00  1.1430
2015-07-19 16:00:00 -1.0080
2015-07-19 18:00:00  0.4009
2015-07-19 19:00:00 -1.8434
2015-07-19 20:00:00  0.5049
2015-07-19 23:00:00 -0.5349

[2000 rows x 1 columns]

# processing
# ==================================
# the gap distribution
res = (pd.Series(df.index[1:]) - pd.Series(df.index[:-1])).value_counts()

01:00:00    1181
02:00:00     499
03:00:00     180
04:00:00      93
05:00:00      24
06:00:00      10
07:00:00       9
08:00:00       3
dtype: int64

# the mode can be considered as frequency
res.index[0]  # output: Timedelta('0 days 01:00:00')
# or maybe the smallest difference
res.index.min()  # output: Timedelta('0 days 01:00:00')




# get full datetime rng
full_rng = pd.date_range(df.index[0], df.index[-1], freq=res.index[0])
full_rng

DatetimeIndex(['2015-03-02 01:00:00', '2015-03-02 02:00:00',
               '2015-03-02 03:00:00', '2015-03-02 04:00:00',
               '2015-03-02 05:00:00', '2015-03-02 06:00:00',
               '2015-03-02 07:00:00', '2015-03-02 08:00:00',
               '2015-03-02 09:00:00', '2015-03-02 10:00:00', 
               ...
               '2015-07-19 14:00:00', '2015-07-19 15:00:00',
               '2015-07-19 16:00:00', '2015-07-19 17:00:00',
               '2015-07-19 18:00:00', '2015-07-19 19:00:00',
               '2015-07-19 20:00:00', '2015-07-19 21:00:00',
               '2015-07-19 22:00:00', '2015-07-19 23:00:00'],
              dtype='datetime64[ns]', length=3359, freq='H', tz=None)
+3
source

, , pandas.DateTimeIndex.inferred_freq:

dt_ix = pd.date_range('2015-03-02 00:00:00', '2015-07-19 23:00:00', freq='H')
dt_ix._set_freq(None)
dt_ix.inferred_freq
Out[2]: 'H'

pandas.infer_freq :

pd.infer_freq(dt_ix)
Out[3]: 'H'

pandas.infer_freq None. , , pandas.Series.diff:

split_ix = dt_ix.drop(pd.date_range('2015-05-01 00:00:00','2015-05-30 00:00:00', freq='1H'))
split_ix.to_series().diff().min()
Out[4]: Timedelta('0 days 01:00:00')
+3

Minimum time difference with

np.diff(data.index.values).min()

which is usually in ns units. To get the frequency, assuming ns:

freq = 1e9 / np.diff(df.index.values).min().astype(int)
+2
source

Source: https://habr.com/ru/post/1598694/


All Articles