Time series analysis - unevenly distributed measures - pandas + statistical models

Question

Time series analysis - unevenly distributed measures - pandas + statistical models

I have two arrays of numpy light_points and time_points and would like to use some methods of analyzing time series based on this data.

Then I tried this:

import statsmodels.api as sm import pandas as pd tdf = pd.DataFrame({'time':time_points[:]}) rdf = pd.DataFrame({'light':light_points[:]}) rdf.index = pd.DatetimeIndex(freq='w',start=0,periods=len(rdf.light)) #rdf.index = pd.DatetimeIndex(tdf['time'])

This works, but does not do the right thing. Indeed, the measurements are not evenly spaced in time, and if I simply declare the time_points pandas DataFrame as the index of my frame, I get an error message:

 rdf.index = pd.DatetimeIndex(tdf['time']) decomp = sm.tsa.seasonal_decompose(rdf) elif freq is None: raise ValueError("You must specify a freq or x must be a pandas object with a timeseries index") ValueError: You must specify a freq or x must be a pandas object with a timeseries index

I do not know how to fix this. In addition, pandas' TimeSeries out of date.

I tried this:

 rdf = pd.Series({'light':light_points[:]}) rdf.index = pd.DatetimeIndex(tdf['time'])

But this gives me a mismatch in length:

 ValueError: Length mismatch: Expected axis has 1 elements, new values have 122 elements

However, I do not understand where it came from, as rdf ['light'] and tdf ['time'] are the same length ...

In the end, I tried to define my rdf as a series of pandas:

 rdf = pd.Series(light_points[:],index=pd.DatetimeIndex(time_points[:]))

And I get this:

 ValueError: You must specify a freq or x must be a pandas object with a timeseries index

Then I tried to replace the index with instead

  pd.TimeSeries(time_points[:])

And this gives me an error in the seasonal_decompose method line:

 AttributeError: 'Float64Index' object has no attribute 'inferred_freq'

How can I work with unevenly distributed data? I was thinking of creating an approximately evenly distributed time array by adding many unknown values between existing values and using interpolation to “evaluate” these points, but I think there might be a cleaner and simpler solution.

+6

python pandas time-series machine-learning statsmodels

Robin Dec 28 '15 at 13:51

source share

1 answer

Stefan · Accepted Answer · 2015-12-28T16:49:56+0000

seasonal_decompose() requires freq , which is either provided as part of the DateTimeIndex meta-information, can be displayed on pandas.Index.inferred_freq , or by the user as integer , which gives the number of periods per cycle, for example, 12 for monthly (from docstring for seasonal_mean ):

 def seasonal_decompose(x, model="additive", filt=None, freq=None): """ Parameters ---------- x : array-like Time series model : str {"additive", "multiplicative"} Type of seasonal component. Abbreviations are accepted. filt : array-like The filter coefficients for filtering out the seasonal component. The default is a symmetric moving average. freq : int, optional Frequency of the series. Must be used if x is not a pandas object with a timeseries index.

To illustrate - using random sample data:

 length = 400 x = np.sin(np.arange(length)) * 10 + np.random.randn(length) df = pd.DataFrame(data=x, index=pd.date_range(start=datetime(2015, 1, 1), periods=length, freq='w'), columns=['value']) <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 400 entries, 2015-01-04 to 2022-08-28 Freq: W-SUN decomp = sm.tsa.seasonal_decompose(df) data = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid], axis=1) data.columns = ['series', 'trend', 'seasonal', 'resid'] Data columns (total 4 columns): series 400 non-null float64 trend 348 non-null float64 seasonal 400 non-null float64 resid 348 non-null float64 dtypes: float64(4) memory usage: 15.6 KB

So far so good - now randomly dropping elements from DateTimeIndex to create uneven spatial data:

 df = df.iloc[np.unique(np.random.randint(low=0, high=length, size=length * .8))] <class 'pandas.core.frame.DataFrame'> DatetimeIndex: 222 entries, 2015-01-11 to 2022-08-21 Data columns (total 1 columns): value 222 non-null float64 dtypes: float64(1) memory usage: 3.5 KB df.index.freq None df.index.inferred_freq None

Running seasonal_decomp on this data "works":

 decomp = sm.tsa.seasonal_decompose(df, freq=52) data = pd.concat([df, decomp.trend, decomp.seasonal, decomp.resid], axis=1) data.columns = ['series', 'trend', 'seasonal', 'resid'] DatetimeIndex: 224 entries, 2015-01-04 to 2022-08-07 Data columns (total 4 columns): series 224 non-null float64 trend 172 non-null float64 seasonal 224 non-null float64 resid 172 non-null float64 dtypes: float64(4) memory usage: 8.8 KB

The question is how useful the result is. Even without data gaps that complicate the output of seasonal patterns (see .interpolate() example in the release notes, statsmodels qualifies this procedure as follows:

 Notes ----- This is a naive decomposition. More sophisticated methods should be preferred. The additive model is Y[t] = T[t] + S[t] + e[t] The multiplicative model is Y[t] = T[t] * S[t] * e[t] The seasonal component is first removed by applying a convolution filter to the data. The average of this smoothed series for each period is the returned seasonal component.

Time series analysis - unevenly distributed measures - pandas + statistical models

More articles: