Scikit-learn Agglomerative Clustering Mativity Matrix

I am trying to perform limited clustering using the sklearn agglomerative clustering command. To make the algorithm bounded, it requests a “connectivity matrix”. It is described as:

Binding restrictions are imposed through the connectivity matrix: a scipy sparse matrix that has elements only at the intersection of the row and column with the indices of the dataset to be bound. This matrix can be built from a priori information: for example, you may want to group web pages only by combining the pages with a link pointing to one another.

I have a list of observation pairs, which I want the algorithm to force to remain in the same cluster. I can convert this to a sparse matrix scipy(either coo, or csr), but the resulting clusters cannot force constraints.

Some data:

import numpy as np
import scipy as sp
import pandas as pd
import scipy.sparse as ss
from sklearn.cluster import AgglomerativeClustering


# unique ids 
ids = np.arange(10)

# Pairs that should belong to the same cluster
mustLink = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])

# Features for training the model
data = pd.DataFrame([
[.0873,-1.619,-1.343],
[0.697456, 0.410943, 0.804333],
[-1.295829, -0.709441, -0.376771],
[-0.404985, -0.107366, 0.875791],
[-0.404985, -0.107366,  0.875791],
[-0.515996, 0.731980, -1.569586],
[1.024580,  0.409148, 0.149408],
[-0.074604, 1.269414, 0.115744],
[-0.006706, 2.097276, 0.681819],
[-0.432196, 1.249149,-1.159271]])

Convert pairs to a “connectivity matrix”:

# Blank coo matrix to csr
sm = ss.coo_matrix((len(ids), len(ids)), np.int32).tocsr()
# Insert 1 for connected pairs and diagonals
for i in np.arange(len(mustLink)): # add links to both sides of the matrix
    sm[mustLink.loc[i, 'A'], mustLink.loc[i, 'B']] = 1
    sm[mustLink.loc[i, 'B'], mustLink.loc[i, 'A']] = 1
for i in np.arange(sm.tocsr()[1].shape[1]): # add diagonals
    sm[i,i] = 1
sm = sm.tocoo() # convert back to coo format

Train and adapt the agglomeration clustering model:

m = AgglomerativeClustering(n_clusters=6, connectivity=sm)
out = m.fit_predict(X=data)

Warning:

UserWarning: the number of connected components of the connection matrix is ​​7> 1. Finish it to avoid the tree stopping earlier. connectivity, n_components = _fix_connectivity (X, connectivity)

In addition to the ominous warning, the pairs that I hoped belonged to the same cluster do not.

, sklearn distance ( )?

+4
1

sklearn.cluster.AgglomerativeClustering , . , , . "" , , (. ).

, , , , , .

:

UserWarning: 7 > 1. , . , n_components = _fix_connectivity (X, )

, 7 , , 1, . sklearn "" ( , ), .

. , , .

+3

Source: https://habr.com/ru/post/1672404/


All Articles