K means elbow search when the elbow plot is a smooth curve

I am trying to build an elbow k using the code below:

load CSDmat %mydata for k = 2:20 opts = statset('MaxIter', 500, 'Display', 'off'); [IDX1,C1,sumd1,D1] = kmeans(CSDmat,k,'Replicates',5,'options',opts,'distance','correlation');% kmeans matlab [yy,ii] = min(D1'); %% assign points to nearest center distort = 0; distort_across = 0; clear clusts; for nn=1:k I = find(ii==nn); %% indices of points in cluster nn J = find(ii~=nn); %% indices of points not in cluster nn clusts{nn} = I; %% save into clusts cell array if (length(I)>0) mu(nn,:) = mean(CSDmat(I,:)); %% update mean %% Compute within class distortion muB = repmat(mu(nn,:),length(I),1); distort = distort+sum(sum((CSDmat(I,:)-muB).^2)); %% Compute across class distortion muB = repmat(mu(nn,:),length(J),1); distort_across = distort_across + sum(sum((CSDmat(J,:)-muB).^2)); end end %% Set distortion as the ratio between the within %% class scatter and the across class scatter distort = distort/(distort_across+eps); bestD(k)=distort; bestC=clusts; end figure; plot(bestD); 

bestD values ​​(within cluster dispersion / between cluster dispersion)

 [ 0.401970132754914 0.193697163350293 0.119427184084282 0.0872681777446508 0.0687948264457301 0.0566215549396577 0.0481117619129058 0.0420491551659459 0.0361696583755145 0.0320384092689509 0.0288948343304147 0.0262373245283877 0.0239462330460614 0.0218350896369853 0.0201506779033703 0.0186757121130685 0.0176258625858971 0.0163239661159014 0.0154933431470081 ] 

Adapted from Lihi Zelnik-Manor, March 2005, Caltech.

The ratio of the graph within the cluster dispersion to the dispersion of the cluster is a smooth curve with a knee that is smooth, like the curve, bestD data above. How to find a knee for such graphs?

+6
source share
1 answer

I think that it is better to use only "inside the distortion class" as an optimization parameter:

 %% Compute within class distortion muB = repmat(mu(nn,:),length(I),1); distort = distort+sum(sum((CSDmat(I,:)-muB).^2)); 

Use without , dividing this value by "distort_across". If you calculate the "derivation" of this:

 unexplained_error = within_class_distortion; derivative = diff(unexplained_error); plot(derivative) 

The derivative (k) tells you how unexplained the error has been reduced by adding a new cluster. I suggest you stop adding clusters when the reduction in this error is less than ten times more than what you got.

 for (i=1:length(derivative)) if (derivative(i) < derivative(1)/10) break end end k_opt = i+1; 

Actually, the method of obtaining the optimal number of clusters is application dependent, but I think you can get a good k value using this suggestion.

0
source

Source: https://habr.com/ru/post/917658/


All Articles