Agglomerative clustering in Matlab

I have a simple two-dimensional data set that I want to copy in an agglomerative way (without knowing the optimal number of clusters used). The only way I was able to group my data was to set the function to "maxclust".

For simplicity, let's say this is my dataset:

X=[ 1,1; 1,2; 2,2; 2,1; 5,4; 5,5; 6,5; 6,4 ]; 

Naturally, I would like this data to form 2 clusters. I understand that if I knew this, I could just say:

 T = clusterdata(X,'maxclust',2); 

and to find which points fall into each cluster, I could say:

 cluster_1 = X(T==1, :); 

and

 cluster_2 = X(T==2, :); 

but not knowing that 2 clusters will be optimal for this data set, how do I group this data?

thanks

+6
source share
3 answers

The whole point of this method is that it represents the clusters found in the hierarchy, and you decide how many details you want to get.

agglomerativedendogram

Think of it as a horizontal line crossing a dendrogram that moves from 0 (each point is its own cluster) up to its maximum value (all points in one cluster). You could:

  • stop when you reach the specified number of clusters ( example )
  • manually place it with a specific height value ( example )
  • choose a place where the clusters are too far from each other according to the distance criterion (i.e. a big jump to the next level) ( example )

This can be done either using the 'maxclust' or 'cutoff' of the CLUSTER / CLUSTERDATA functions

+5
source

To select the optimal number of clusters, one common approach is to make the plot similar to Scree Plot. Then you look for the โ€œelbowโ€ in the plot, and this is the number of clusters that you select. For the criterion here we will use the intracluster sums of squares:

 function wss = plotScree(X, n) wss = zeros(1, n); wss(1) = (size(X, 1)-1) * sum(var(X, [], 1)); for i=2:n T = clusterdata(X,'maxclust',i); wss(i) = sum((grpstats(T, T, 'numel')-1) .* sum(grpstats(X, T, 'var'), 2)); end hold on plot(wss) plot(wss, '.') xlabel('Number of clusters') ylabel('Within-cluster sum-of-squares') 
 >> plotScree(X, 5) ans = 54.0000 4.0000 3.3333 2.5000 2.0000 

enter image description here

+5
source

You can use the NbClust package in R, which uses 30 indexes to determine the optimal number of clusters in a dataset.

-1
source

Source: https://habr.com/ru/post/900824/


All Articles