Clustering with scipy clusters through a distance matrix, how to return the original objects

I cannot stitch to find simple enough tutorials or descriptions for clustering in scipy, so I will try to explain my problem:

I am trying to copy documents (hierarchical agglomerative clustering) and created a vector for each document and created a symmetric distance matrix. Vector_list contains (very long) vectors representing each document. The order of this list of vectors is the same as my list of input documents so that I (hopefully) can compare the clustering results with the corresponding document.

distances = distance.cdist(vector_list, vector_list, 'euclidean') 

This gives a matrix where the diagonal line is the distance between the documents to itself (always 0)

 [0 5 4] [5 0 4] [5 4 0] 

I pass this distance matrix to the scipys' linkage () function.

 clusters = hier.linkage(distances, method='centroid', metric='euclidean') 

this returns what I am not quite sure of what it is, but it comes out as the numpy.ndarray data type. According to the docs, I can submit this to fcluster again to get “flat clusters”. I use half the maximum distance in the distance matrix as a threshold.

 idx = hier.fcluster(clu,0.5*distances.max(), 'distance') 

This returns numpy.ndarray, which again makes no sense to me. An example is [6 3 1 7 1 8 9 4 5 2]

So my question is: what do I get from the linkage and fcluster functions , and how can I go from there and back to my document that I created in the first place to see if clusters make sense? Am I doing it right?

+6
source share
1 answer

First, you don’t have to go through the whole process with cdist and linkage if you use fclusterdata instead of fcluster ; you can feed this function an array (n_documents, n_features) counters of terms, tf-idf values ​​or any of your functions.

The output from fclusterdata is the same as fcluster : an array T such that " T[i] is the number of the flat cluster to which the original observation i belongs." Ie, the cluster.hierarchy module aligns the clustering according to the threshold set to 0.5*distances.max() . In your case, the third and fifth documents are grouped together, but all the others form their own clusters, so you can set the threshold higher or use another criterion .

+3
source

Source: https://habr.com/ru/post/899047/


All Articles