I cannot stitch to find simple enough tutorials or descriptions for clustering in scipy, so I will try to explain my problem:
I am trying to copy documents (hierarchical agglomerative clustering) and created a vector for each document and created a symmetric distance matrix. Vector_list contains (very long) vectors representing each document. The order of this list of vectors is the same as my list of input documents so that I (hopefully) can compare the clustering results with the corresponding document.
distances = distance.cdist(vector_list, vector_list, 'euclidean')
This gives a matrix where the diagonal line is the distance between the documents to itself (always 0)
[0 5 4] [5 0 4] [5 4 0]
I pass this distance matrix to the scipys' linkage () function.
clusters = hier.linkage(distances, method='centroid', metric='euclidean')
this returns what I am not quite sure of what it is, but it comes out as the numpy.ndarray data type. According to the docs, I can submit this to fcluster again to get “flat clusters”. I use half the maximum distance in the distance matrix as a threshold.
idx = hier.fcluster(clu,0.5*distances.max(), 'distance')
This returns numpy.ndarray, which again makes no sense to me. An example is [6 3 1 7 1 8 9 4 5 2]
So my question is: what do I get from the linkage and fcluster functions , and how can I go from there and back to my document that I created in the first place to see if clusters make sense? Am I doing it right?
source share