I am trying to perform limited clustering using the sklearn agglomerative clustering command. To make the algorithm bounded, it requests a “connectivity matrix”. It is described as:
Binding restrictions are imposed through the connectivity matrix: a scipy sparse matrix that has elements only at the intersection of the row and column with the indices of the dataset to be bound. This matrix can be built from a priori information: for example, you may want to group web pages only by combining the pages with a link pointing to one another.
I have a list of observation pairs, which I want the algorithm to force to remain in the same cluster. I can convert this to a sparse matrix scipy(either coo, or csr), but the resulting clusters cannot force constraints.
Some data:
import numpy as np
import scipy as sp
import pandas as pd
import scipy.sparse as ss
from sklearn.cluster import AgglomerativeClustering
# unique ids
ids = np.arange(10)
# Pairs that should belong to the same cluster
mustLink = pd.DataFrame([[1, 2], [1, 3], [4, 6]], columns=['A', 'B'])
# Features for training the model
data = pd.DataFrame([
[.0873,-1.619,-1.343],
[0.697456, 0.410943, 0.804333],
[-1.295829, -0.709441, -0.376771],
[-0.404985, -0.107366, 0.875791],
[-0.404985, -0.107366, 0.875791],
[-0.515996, 0.731980, -1.569586],
[1.024580, 0.409148, 0.149408],
[-0.074604, 1.269414, 0.115744],
[-0.006706, 2.097276, 0.681819],
[-0.432196, 1.249149,-1.159271]])
Convert pairs to a “connectivity matrix”:
sm = ss.coo_matrix((len(ids), len(ids)), np.int32).tocsr()
for i in np.arange(len(mustLink)):
sm[mustLink.loc[i, 'A'], mustLink.loc[i, 'B']] = 1
sm[mustLink.loc[i, 'B'], mustLink.loc[i, 'A']] = 1
for i in np.arange(sm.tocsr()[1].shape[1]):
sm[i,i] = 1
sm = sm.tocoo()
Train and adapt the agglomeration clustering model:
m = AgglomerativeClustering(n_clusters=6, connectivity=sm)
out = m.fit_predict(X=data)
Warning:
UserWarning: the number of connected components of the connection matrix is 7> 1. Finish it to avoid the tree stopping earlier. connectivity, n_components = _fix_connectivity (X, connectivity)
In addition to the ominous warning, the pairs that I hoped belonged to the same cluster do not.
, sklearn distance ( )?