Many clustering methods, including scipy.cluster, start with sorting all the pair distances, ~ 60 million in your case, not too big.
How much time do they take for you?
import scipy.cluster.hierarchy as hier
import pylab as pl
def fcluster( pts, ncluster, method="average", criterion="maxclust" ):
""" -> (pts, Y pdist, Z linkage, T fcluster, clusterlists)
ncluster = n1 + n2 + ... (including n1 singletons)
av cluster size = len(pts) / ncluster
"""
pts = np.asarray(pts)
Y = scipy.spatial.distance.pdist( pts )
Z = hier.linkage( Y, method )
T = hier.fcluster( Z, ncluster, criterion=criterion )
return (pts, Y, Z, T)
hier.dendrogram( Z )
How to set up the matrix and graph
here
in So in March, with a partial answer.
denis source
share