Choosing a High Dimension Clustering Method?

If the data for the cluster are literal points (either 2D (x, y) , or 3D (x, y,z) ), it would be quite intuitive to choose a clustering method. Since we can draw and visualize them, we better know which clustering method is more suitable.

eg1 If my 2D dataset has the formula shown in the upper right, I would know that K-means might not be the smart choice here, whereas DBSCAN seems to be the best idea.

enter image description here

However , just like the scikit-learn site :

Although these examples give some intuition about algorithms, this intuition may not apply to very high dimensional data.

AFAIK, in most pirated issues we don’t have such simple data. Most likely, we have arrogant tuples that cannot be visualized like data.

eg2 I want to group a data set where each information is represented as 4-D tuples <characteristic1, characteristic2, characteristic3, characteristic4> . I can not visualize it in the coordinate system and observe its distribution, as before. Therefore, I cannot say that DBSCAN in this case is superior to K-means .

So my question is :

How to choose a suitable clustering method for such an "invisible" multidimensional case?

+4
source share
4 answers

High-dimensional clustering probably starts with about 10-20 measurements in dense data and more than 1000 measurements in sparse data (for example, text).

4 measurements are not a big problem and can be visualized; for example, using several projections of 2d (or even 3d, using rotation); or using parallel coordinates. Here is a visualization of a 4-dimensional aperture dataset using a scatter plot matrix.

However, the first thing you still need to do is spend a lot of time on the preprocessing and find the corresponding distance function.

If you really need methods for high-dimensional data, look at subspace clustering and correlation clustering, for example.

  • Kriegel, Hans-Peter, Pehar Kraeger and Arthur Ziemek. High-Dimensional Data Clustering: An overview of subspace clustering, pattern-based clustering, and correlation clustering. ACM Transactions when discovering knowledge from data (TKDD) 3.1 (2009): 1.

The authors of this review also publish a software environment in which there are many of these advanced clustering methods (not just k-tools, but eh CASH, FourC, ERiC): ELKI

+6
source

There are at least two common approaches:

  • You can use some dimension reduction method to actually visualize data with a high size; there are dozens of popular solutions, including (but not limited to):

    • PCA - analysis of the main components
    • SOM - self-organizing cards
    • Sammon mapping
    • Auto Director Neural Networks
    • KPCA - analysis of the core components of the kernel
    • Isomap

    After that, he returns to the original space and uses some methods that seem resonant based on observations in reduced space or perform clustering in the reduced space itself. The first approach uses all available information, but may be invalid due to differences induced by the recovery process. While the second ensures that your observations and choices are valid (as you reduce your problem to a pretty one, 2d / 3d), but it loses a lot of information due to the conversion used.

  • One is trying to find many different algorithms and choose one of them with the best metrics (many criteria for evaluating clustering have been proposed). This is an expensive approach, but has a lower bias (since the restoration of dimension leads to a change in the information resulting from the transformation used)

+5
source

True, high-size data cannot be easily visualized in Euclidean high-dimensional data, but it is not true that there are no visualization methods for them.

In addition to this statement, I will add that with only four functions (your measurements) you can easily try the parallel coordinates of the visualization method . Or just try multivariate data analysis using two functions at a time (6 times in total) to try to figure out what kind of relationship is there between the two (correlation and dependence in general). Or you can even use three-dimensional space three at a time.

Then, how to get information from these visualizations? Well, it's not as simple as in Euclidean space, but the point of view is visually displayed if the data cluster in some groups (for example, next to some values ​​on the axis for a parallel coordinate diagram) and thinks that the data is somehow separable (for example, if it forms areas like circles or lines shared on scatterplots).

A small digression: the diagram you published does not indicate the power or capabilities of each algorithm, taking into account some specific data distributions, it simply emphasizes the nature of some algorithms: for example, the k-tool allows you to separate only convex and ellipsoidal regions (and keep in mind that bulge and ellipsoids exist even in the nth dimension). I mean, there is no rule that says: given the distributions shown in this diagram, you must choose the right clustering algorithm.

I suggest using a data mining toolbar that allows you to examine and visualize data (and easily transform it, since you can change your topology with transformations, forecasts and abbreviations, check another lejlot answer for this), for example Weka (plus you no need to run all the algorithms yourself.

In the end, I will show you this resource for various parameters of the Q factor and suitability of the cluster so that you can compare the results with other algorithms.

+2
source

I would also suggest soft subspace clustering , a fairly common approach nowadays, where weight attributes are added to find the most important functions. You can use these weights to improve performance and, for example, improve BMU calculation with Euclidean distance.

+1
source

Source: https://habr.com/ru/post/1502362/


All Articles