Large-scale clustering library, possibly with python bindings

Question

Large-scale clustering library, possibly with python bindings

I am trying to copy some larger data. consisting of 50,000 measurement vectors with size 7. I am trying to create about 30-300 clusters for further processing.

I am trying to perform the following no-luck clustering implementations:

Pycluster.kcluster (gives only 1-2 non-empty clusters in my dataset)
scipy.cluster.hierarchy.fclusterdata (works too long)
scipy.cluster.vq.kmeans (out of memory)
sklearn.cluster.hierarchical.Ward (works too long)

Are there any other features that I could miss?

+6

python cluster-analysis data-mining

tisch Jun 18 '12 at 23:42

source share

4 answers

Since you're already trying scikit-learn: sklearn.cluster.KMeans should scale better than Ward , and supports parallel installation on multi-core machines. MiniBatchKMeans better, but will not do random reboots for you.

 >>> from sklearn.cluster import MiniBatchKMeans >>> X = np.random.randn(50000, 7) >>> %timeit MiniBatchKMeans(30).fit(X) 1 loops, best of 3: 114 ms per loop

+6

Fred foo Jun 19 '12 at 8:42

source share

My milk package handles this problem easily:

 import milk import numpy as np data = np.random.rand(50000,7) %timeit milk.kmeans(data, 300) 1 loops, best of 3: 14.3 s per loop

I wonder if you want to write 500,000 data points, because 50 thousand points is not so much. If so, milk takes some time (~ 700 seconds), but still processes it well, since it does not allocate any memory other than your data and centroids.

+2

luispedro 21 sept '12 at 16:50

source share

OpenCV has an implementation of k-means, Kmeans2

The expected operating time is of the order of O(n**4) - to approximate the order of magnitude, see how long it takes for a cluster to have 1000 points, then multiply this by seven million (50 ** 4 rounded).

0

Hugh bothwell Jun 19 '12 at 2:16

source share

Anony-mousse · Accepted Answer · 2012-06-19T06:06:18+0000

50,000 copies and 7 dimensions are not very large and should not kill the implementation.

Although it doesn't have python bindings, try ELKI . The set of tests that they use on their homepage is 110,250 copies in 8 dimensions, and they launched k-tools on it in 60 seconds, apparently, and a much more advanced OPTICS in 350 seconds.

Avoid hierarchical clustering. This is only valid for small data sets. It is usually implemented when working with O(n^3) matrices, which is very bad for large data sets. Therefore, I am not surprised that these two timeouts are for you.

DBSCAN and OPTICS, when implemented with index support, are O(n log n) . When they are implemented naively, they are in O(n^2) . K-agents are indeed fast, but often the results are not satisfactory (because it always splits in the middle). It should work in O(n * k * iter) , which usually doesn't converge too many iterations ( iter<<100 ). But it will work only with Euclidean distance and just does not work with some data (high-dimensional, discrete, binary, clusters of different sizes, ...)

Large-scale clustering library, possibly with python bindings

More articles: