Checking output from the clustering algorithm

Question

Checking output from the clustering algorithm

Is there an objective way to verify the output of the clustering algorithm?

I use scikit-learn clustering propagation affinity against a dataset consisting of objects with many attributes. The difference matrix provided to the clustering algorithm consists of a weighted difference of these attributes. I’m looking for a way to objectively approve the settings in the distance values reflected in the resulting clusters. The data set is large and has enough attributes that manual research of small examples is not a reasonable way to verify the produced clusters.

+4

scipy scikit-learn machine-learning cluster-analysis data-mining

Andrew M 01 Oct '12 at 19:50

source share

2 answers

There is another way to evaluate the quality of clustering by calculating the stability metric on the stands, a bit like cross-checking for controlled models:

Divide the data set 3 times A, B and C. Calculate the two clusters with your algorithm on A + B and A + C. Calculate the adjusted Rand index or the adjusted mutual information of the two labels at their intersection A and consider this value as an estimate of the stability estimation of the algorithm .
Rinse-repeat by shuffling the data and breaking it into 3 other folds A ', B' and C 'and recounting the stability score.
The average stability estimates for 5 or 10 runs have a rough estimate of the standard error of stability assessment.

As you can guess, this is a very computer based evaluation method.

The research area remains open to know whether this assessment of stability-based clustering algorithms is really useful in practice and to determine when it cannot create an acceptable criterion for model selection. Please refer to Clustering Stability: a review from Ulrike von Luxburg and links in it to review the status of these issues.

Note: It is important to use Adjusted for Chance metrics such as ARI or AMI if you want to use this strategy to select the best k value in k-tools, for example. Unadjusted metrics such as NMI and V-measures will generally be preferred for models with higher k.

+3

ogrisel Oct 3 '12 at 10:15

source share

Anony-mousse · Accepted Answer · 2012-10-01T22:34:11+0000

Yes

Give the cluster an expert domain and ask him to analyze whether the structure found by the algorithm is reasonable. Not so much if it is new, but if it is reasonable.

... and No :

There is no automatic rating that is fair. In the sense that it takes into account the goal of uncontrolled clustering: discovering knowledge aka: learn something new about your data.

There are two common ways to evaluate clusters:

internal cohesion. That is, there is a certain property, such as dispersion in a cluster compared to dispersion between clusters to minimize. The problem is that it's usually pretty trivial to cheat. That is, to build a trivial solution that is very well evaluated. Therefore, this method should not be used to compare methods based on different assumptions. You cannot even compare different types of relationships for hierarchical clustering.
external assessment. You use a labeled dataset and evaluate the algorithms how well they rediscover existing knowledge. Sometimes this works quite well, so for an assessment this is an acceptable level of technology. However, any controlled or semi-controlled method, of course, would be much better at evaluating this. Thus, this A) is biased against controlled methods, and B) is actually completely opposed to the idea of discovering knowledge about finding what you did not know yet.

If you really want to use clustering - that is, you will learn something about your data - you will at some point have to check the clusters, preferably using a completely independent method, such as a domain expert. If he can tell you, for example, a user group identified by clustering is a nontrivial group that has not yet been studied carefully, then you are the winner.

However, most people want a one-click rating (and one rating), unfortunately.

Oh, and “clustering” is not really a machine learning task. There is really no training. For the machine learning community, this is an ugly duckling that no one cares about.

Checking output from the clustering algorithm

More articles: