50,000 copies and 7 dimensions are not very large and should not kill the implementation.
Although it doesn't have python bindings, try ELKI . The set of tests that they use on their homepage is 110,250 copies in 8 dimensions, and they launched k-tools on it in 60 seconds, apparently, and a much more advanced OPTICS in 350 seconds.
Avoid hierarchical clustering. This is only valid for small data sets. It is usually implemented when working with O(n^3)
matrices, which is very bad for large data sets. Therefore, I am not surprised that these two timeouts are for you.
DBSCAN and OPTICS, when implemented with index support, are O(n log n)
. When they are implemented naively, they are in O(n^2)
. K-agents are indeed fast, but often the results are not satisfactory (because it always splits in the middle). It should work in O(n * k * iter)
, which usually doesn't converge too many iterations ( iter<<100
). But it will work only with Euclidean distance and just does not work with some data (high-dimensional, discrete, binary, clusters of different sizes, ...)
source share