Choosing a suitable similarity metric and evaluating the reliability of a k-means clustering model

Question

Choosing a suitable similarity metric and evaluating the reliability of a k-means clustering model

I implemented k-value clustering to define clusters in 300 objects. Each of my objects has about 30 dimensions. Distance is calculated using the Euclidean metric.

I need to know

How to determine if my algorithms work correctly? I cannot have a graph that will give some idea of the correctness of my algorithm.
Is Euclidean distance the correct method for calculating distances? What if I have 100 measurements instead of 30?

+4

unsupervised-learning machine-learning cluster-analysis k-means

user350556 Nov 12 '11 at 4:18

source share

3 answers

doug · Answer 1 · 2011-11-14T12:33:16+0000

Two questions in OP are separate topics (i.e., no matches in the answers), so I will try to answer them one at a time, looking at element 1 in the list.

How to determine if my [clustering] algorithms work correctly?

k-means that, like other uncontrolled ML methods, there is no good choice of diagnostic tests to answer questions such as "are the cluster assignments returned by the k-means more significant for k = 3 or k = 5?"

However, there is one widespread test that gives intuitive results, and it's easy to apply. This diagnostic metric is simply a ratio:

interchain separation / intracluster dispersion

As the value of this relationship increases, the quality of your clustering result increases.

This is intuitive. The first of these indicators is how far each cluster is from each other (measured in accordance with the centers of the clusters)?

But the interneteloid separation itself does not reflect the whole story, because two clustering algorithms can return results that have the same interchain separation, although it is obvious that it is better because the clusters are “more rigid” (ie, smaller radii); in other words, the edges of the cluster have a greater separation. For this, the second metric is taken into account - intracluster dispersion. This is simply the average variance calculated for each cluster.

In general, the ratio of intercentroid separation to variance within a cluster is a fast, consistent, and reliable method for comparing results from different clustering algorithms or for comparing the results of the same algorithm performed under different variable parameters, -eg, number of iterations, choice of distance metric, number centroids (k value).

The desired result is dense (small) clusters, each of which is far from the others.

The calculation is simple:

For intercentroid separation:

calculate the pair distance between the centers of the clusters; then
calculate the median of these distances.

For intracluster dispersion:

for each cluster, calculate the distance of each data point in a given cluster from its cluster center; Following
(for each cluster) calculate the variance of the sequence of distances from the above step; then
average these variance values.

This is my answer to the first question. Here's the second question:

Is Euclidean distance the correct method for calculating distances? What if I have 100 dimensions instead of 30?

Firstly, a simple question - is the Euclidean distance of the actual metric as sizes / features increase?

Euclidean distance is highly scalable - it works for two dimensions or two thousand. For any pair of data points:

subtract their object vectors by elements,
the square of each element in this result vector,
summarize this result,
take the square root of this scalar.

Nowhere in this sequence of calculations is scale involved.

But regardless of whether the Euclidean distance is a suitable similarity metric for your problem, depends on your data. For example, is it purely numerical (continuous)? Or it has discrete (categorical) variables (e.g. gender? M / F). If one of your measurements is “current location” and 200 users, 100 have the value “San Francisco” and the other 100 have “Boston”, you cannot say that on average your users are somewhere in Kansas, but that’s what the Euclidean distance will do.

In any case, since we don’t know anything about this, I’ll just give you a simple flowchart so that you can apply it to your data and determine a suitable similarity metric.

To determine the appropriate affinity metric, given your data:

Anton · Answer 2 · 2011-11-13T06:52:38+0000

Euclidean distance is good if the dimensions are comparable and on the same scale. If one dimension represents the length and the other - the weight of the item - the Euclidean should be replaced by a weighted one.
Do it in 2d and show the image - this is a good option for visual visual display, if it works. Or you can use some health check - for example, find cluster centers and see that all the elements in the cluster are not too far removed.

denis · Answer 3 · 2011-11-23T10:52:15+0000

Can't you just try sum | xi - yi | instead, if (xi - yi) ^ 2 in your code and see if that matters?

I cannot have a graph that will give some idea of the correctness of my algorithm.

Several possibilities:

look at several points halfway between two clusters more
change k a bit, see what happens (what is your k?)
Using PCA to display 30d to 2d; see graphs calculating-the-percentage-of-variance-measure-for-k-means , also SO questions / tagged / pca

By the way, scipy.spatial.cKDTree can easily let you say the 3 nearest neighbors of each point, at p = 2 (Euclidean) or p = 1 (Manhattan, L1) to watch. It is fast up to ~ 20d, and with early shutdown it works even in 128d.

Added: I like the cosine distance in large sizes; see euclidean-distance-is-usually-not-good-for-sparse-data for what.

Choosing a suitable similarity metric and evaluating the reliability of a k-means clustering model

More articles: