Large-scale clustering library, possibly with python bindings

I am trying to copy some larger data. consisting of 50,000 measurement vectors with size 7. I am trying to create about 30-300 clusters for further processing.

I am trying to perform the following no-luck clustering implementations:

  • Pycluster.kcluster (gives only 1-2 non-empty clusters in my dataset)
  • scipy.cluster.hierarchy.fclusterdata (works too long)
  • scipy.cluster.vq.kmeans (out of memory)
  • sklearn.cluster.hierarchical.Ward (works too long)

Are there any other features that I could miss?

+6
source share
4 answers

50,000 copies and 7 dimensions are not very large and should not kill the implementation.

Although it doesn't have python bindings, try ELKI . The set of tests that they use on their homepage is 110,250 copies in 8 dimensions, and they launched k-tools on it in 60 seconds, apparently, and a much more advanced OPTICS in 350 seconds.

Avoid hierarchical clustering. This is only valid for small data sets. It is usually implemented when working with O(n^3) matrices, which is very bad for large data sets. Therefore, I am not surprised that these two timeouts are for you.

DBSCAN and OPTICS, when implemented with index support, are O(n log n) . When they are implemented naively, they are in O(n^2) . K-agents are indeed fast, but often the results are not satisfactory (because it always splits in the middle). It should work in O(n * k * iter) , which usually doesn't converge too many iterations ( iter<<100 ). But it will work only with Euclidean distance and just does not work with some data (high-dimensional, discrete, binary, clusters of different sizes, ...)

+9
source

Since you're already trying scikit-learn: sklearn.cluster.KMeans should scale better than Ward , and supports parallel installation on multi-core machines. MiniBatchKMeans better, but will not do random reboots for you.

 >>> from sklearn.cluster import MiniBatchKMeans >>> X = np.random.randn(50000, 7) >>> %timeit MiniBatchKMeans(30).fit(X) 1 loops, best of 3: 114 ms per loop 
+6
source

My milk package handles this problem easily:

 import milk import numpy as np data = np.random.rand(50000,7) %timeit milk.kmeans(data, 300) 1 loops, best of 3: 14.3 s per loop 

I wonder if you want to write 500,000 data points, because 50 thousand points is not so much. If so, milk takes some time (~ 700 seconds), but still processes it well, since it does not allocate any memory other than your data and centroids.

+2
source

OpenCV has an implementation of k-means, Kmeans2

The expected operating time is of the order of O(n**4) - to approximate the order of magnitude, see how long it takes for a cluster to have 1000 points, then multiply this by seven million (50 ** 4 rounded).

0
source

Source: https://habr.com/ru/post/918421/


All Articles