How to use k-tools for time series data that have nans?

I have several time series entries that partially overlap and do not necessarily have the same start and end dates. Each row represents a different time series. I made them the same length to maintain the actual data collection time.

For example, at t (1,2,3,4,5,6):

Station 1: nan, nan, 2, 4, 5, 10 Station 2: nan, 1, 4, nan, 10, 8 Station 3: 1, 9, 4, 7, nan, nan 

I am trying to run cluster analysis in Python to group stations with similar behavior where action time is important, so I can't just get rid of nans. (What do I know).

Any ideas?

+4
source share
1 answer

K-tool is not the best algorithm for this kind of data.

The K-tool is designed to minimize dispersion within the cluster (= sum of squares, WCSS).

But how do you calculate the deviation from NaN? And how significant is the deviation here?

You can use instead

  • similarity measure designed for time series such as DTW, threshold distances, etc.
  • distance-based clustering algorithm. If you have only a few episodes, hierarchical clustering should be great.
+1
source

Source: https://habr.com/ru/post/1500636/


All Articles