Building large time series

Question

Building large time series

Summary of the issue :

Are there any easy-to-use algorithms to reduce the number of points needed to represent a time series, without changing how it appears in the plot?

Motivational problem :

I am trying to interactively visualize 10 to 15 data channels recorded in the embedded system at ~ 20 kHz. Magazines can span more than an hour, which means that I am dealing with 1e8 and 1e9 points. In addition, I take care of potentially small anomalies that persist for very short periods of time (i.e., Less than 1 ms), so simple thinning is not an option.

It is not surprising that most graphic libraries are a little sad if you do a naive thing and try to pass data arrays to them that exceed the allocated GPU memory. This is actually a little worse than on my system; using the random float vector as a test case, I get only 57 points from the Matlab and Python + matplotlib charting functions before the refresh rate drops below 1 FPS.

Existing issues and solutions:

This problem is somewhat similar to a number of existing issues, such as:

How to build large data vectors at all zoom levels in real time?
How to build large time series (thousands of times of administration / dose of medication)?
[A few cross-reference checking questions]

but deals with larger datasets and / or more stringent in terms of accuracy at the cost of interactivity (it would be great to get a 60 FPS silky smooth pan and zoom, but in reality I would be happy with 1 FPS).

Obviously, some form of data reduction is needed. There are two paradigms that I found while searching for existing tools that solve my problem:

Decimate, but track outliers: A good example of this is Matlab + dsplot (i.e. the tool suggested in the accepted answer of the first question I linked above). dsplot decimates to a fixed number of evenly spaced points, but then adds back to outliers identified using the standard deviation of the FIR high-pass filter. Although this is probably a viable solution for several classes of data, it potentially has difficulties if there is a significant frequency content with a filter cutoff frequency and tuning may be required.
Graph min and max:. With this approach, you divide the time series up into the intervals corresponding to each horizontal pixel, and draw only the minimum and maximum values in each interval. Matlab + Plot (Big) is a good example of this, but uses min and max O (n) calculation, making it a little slow by the time you get to 1e8 or 1e9 points. A binary search tree in a mex function or python would solve this problem, but is difficult to implement.

Are there any simpler solutions that do what I want?

Edit (2018-02-18): Question reorganized to focus on algorithms instead of tools that implement algorithms.

+5

time-series plot

user2465116 Feb 17 '18 at 10:44

source share

1 answer

Camilo rada · Accepted Answer · 2018-02-17T18:25:42+0000

I had the same problem showing the pressure relief times of hundreds of sensors, with samples every minute for several years. In some cases (for example, when cleaning data) I wanted to see all the outliers, others were more interested in the trend. Therefore, I wrote a function that can reduce the number of data points using two methods: visvalingam and Douglas-Peucker. The first, as a rule, eliminate emissions, and the second holds them. I optimized this feature to work on large datasets. I did this after I realized that all the graphing methods are not able to cope with so many points, and those that did this destroyed the data set in a way that I could not control. The function is as follows:

function [X, Y, indices, relevance] = lineSimplificationI(X,Y,N,method,option) %lineSimplification Reduce the number of points of the line described by X %and Y to N. Preserving the most relevant ones. % Using an adapted method of visvalingam and Douglas-Peucker algorithms. % The number of points of the line is reduced iteratively until reaching % N non-NaN points. Repeated NaN points in original data are deleted but % non-repeated NaNs are preserved to keep line breaks. % The two available methods are % % Visvalingam: The relevance of a point is proportional to the area of % the triangle defined by the point and its two neighbors. % % Douglas-Peucker: The relevance of a point is proportional to the % distance between it and the straight line defined by its two neighbors. % Note that the implementation here is iterative but NOT recursive as in % the original algorithm. This allows to better handle large data sets. % % DIFFERENCES: Visvalingam tend to remove outliers while Douglas-Peucker % keeps them. % % INPUTS: % X: X coordinates of the line points % Y: Y coordinates of the line points % method: Either 'Visvalingam' or 'DouglasPeucker' (default) % option: Either 'silent' (default) or 'verbose' if additional outputs % of the calculations are desired. % % OUTPUTS: % X: X coordinates of the simplified line points % Y: Y coordinates of the simplified line points % indices: Indices to the positions of the points preserved in the % original X and Y. Therefore Output X is equal to the input % X(indices). % relevance: Relevance of the returned points. It can be used to furder % simplify the line dinamically by keeping only points with % higher relevance. But this will produce bigger distortions of % the line shape than calling again lineSimplification with a % smaller value for N, as removing a point changes the relevance % of its neighbors. % % Implementation by Camilo Rada - camilo@rada.cl % if nargin < 3 error('Line points positions X, Y and target point count N MUST be specified'); end if nargin < 4 method='DouglasPeucker'; end if nargin < 5 option='silent'; end doDisplay=strcmp(option,'verbose'); X=double(X(:)); Y=double(Y(:)); indices=1:length(Y); if length(X)~=length(Y) error('Vectors X and Y MUST have the same number of elements'); end if N>=length(Y) relevance=ones(length(Y),1); if doDisplay disp('N is greater or equal than the number of points in the line. Original X,Y were returned. Relevances were not computed.') end return end % Removing repeated NaN from Y % We find all the NaNs with another NaN to the left repeatedNaNs= isnan(Y(2:end)) & isnan(Y(1:end-1)); %We also consider a repeated NaN the first element if NaN repeatedNaNs=[isnan(Y(1)); repeatedNaNs(:)]; Y=Y(~repeatedNaNs); X=X(~repeatedNaNs); indices=indices(~repeatedNaNs); %Removing trailing NaN if any if isnan(Y(end)) Y=Y(1:end-1); X=X(1:end-1); indices=indices(1:end-1); end pCount=length(X); if doDisplay disp(['Initial point count = ' num2str(pCount)]) disp(['Non repeated NaN count in data = ' num2str(sum(isnan(Y)))]) end iterCount=0; while pCount>N iterCount=iterCount+1; % If the vertices of a triangle are at the points (x1,y1) , (x2, y2) and % (x3,y3) the are uf such triangle is % area = abs((x1*(y2-y3)+x2*(y3-y1)+x3*(y1-y2))/2) % now the areas of the triangles defined by each point of X,Y and its two % neighbors are twiceTriangleArea =abs((X(1:end-2).*(Y(2:end-1)-Y(3:end))+X(2:end-1).*(Y(3:end)-Y(1:end-2))+X(3:end).*(Y(1:end-2)-Y(2:end-1)))); switch method case 'Visvalingam' % In this case the relevance is given by the area of the % triangle formed by each point end the two points besides relevance=twiceTriangleArea/2; case 'DouglasPeucker' % In this case the relevance is given by the minimum distance % from the point to the line formed by its two neighbors neighborDistances=ppDistance([X(1:end-2) Y(1:end-2)],[X(3:end) Y(3:end)]); relevance=twiceTriangleArea./neighborDistances; otherwise error(['Unknown method: ' method]); end relevance=[Inf; relevance; Inf]; %We remove the pCount-N least relevant points as long as they are not contiguous [srelevance, sortorder]= sort(relevance,'descend'); firstFinite=find(isfinite(srelevance),1,'first'); startPos=uint32(firstFinite+N+1); toRemove=sort(sortorder(startPos:end)); if isempty(toRemove) break; end %Now we have to deal with contigous elements, as removing one will %change the relevance of the neighbors. Therefore we have to %identify pairs of contigous points and only remove the one with %leeser relevance %Contigous will be true for an element if the next or the previous %element is also flagged for removal contiguousToKeep=[diff(toRemove(:))==1; false] | [false; (toRemove(1:end-1)-toRemove(2:end))==-1]; notContiguous=~contiguousToKeep; %And the relevances asoociated to the elements flagged for removal contRel=relevance(toRemove); % Now we rearrange contigous so it is sorted in two rows, therefore % if both rows are true in a given column, we have a case of two % contigous points that are both flagged for removal % this process is demenden of the rearrangement, as contigous % elements can end up in different colums, so it has to be done % twice to make sure no contigous elements are removed nContiguous=length(contiguousToKeep); for paddingMode=1:2 %The rearragngement is only possible if we have an even number of %elements, so we add one dummy zero at the end if needed if paddingMode==1 if mod(nContiguous,2) pcontiguous=[contiguousToKeep; false]; pcontRel=[contRel; -Inf]; else pcontiguous=contiguousToKeep; pcontRel=contRel; end else if mod(nContiguous,2) pcontiguous=[false; contiguousToKeep]; pcontRel=[-Inf; contRel]; else pcontiguous=[false; contiguousToKeep(1:end-1)]; pcontRel=[-Inf; contRel(1:end-1)]; end end contiguousPairs=reshape(pcontiguous,2,[]); pcontRel=reshape(pcontRel,2,[]); %finding colums with contigous element contCols=all(contiguousPairs); if ~any(contCols) && paddingMode==2 break; end %finding the row of the least relevant element of each column [~, lesserElementRow]=max(pcontRel); %The index in contigous of the first element of each pair is if paddingMode==1 firstElementIdx=((1:size(contiguousPairs,2))*2)-1; else firstElementIdx=((1:size(contiguousPairs,2))*2)-2; end % and the index in contigous of the most relevant element of each % pair is lesserElementIdx=firstElementIdx+lesserElementRow-1; %now we set the least relevant element as NOT continous, so it is %removed contiguousToKeep(lesserElementIdx(contCols))=false; end %and now we delete the relevant continous points from the toRemove %list toRemove=toRemove(contiguousToKeep | notContiguous); if any(diff(toRemove(:))==1) && doDisplay warning([num2str(sum(diff(toRemove(:))==1)) ' continous elements removed in one iteration.']) end toRemoveLogical=false(pCount,1); toRemoveLogical(toRemove)=true; X=X(~toRemoveLogical); Y=Y(~toRemoveLogical); indices=indices(~toRemoveLogical); pCount=length(X); nRemoved=sum(toRemoveLogical); if doDisplay disp(['Iteration ' num2str(iterCount) ', Point count = ' num2str(pCount) ' (' num2str(nRemoved) ' removed)']) end if nRemoved==0 break; end end end function d = ppDistance(p1,p2) d=sqrt((p1(:,1)-p2(:,1)).^2+(p1(:,2)-p2(:,2)).^2); end

Building large time series

More articles: