Silhouette coefficient in python with sklearn

It is hard for me to calculate the silhouette coefficient in python using sklearn. Here is my code:

from sklearn import datasets from sklearn.metrics import * iris = datasets.load_iris() X = pd.DataFrame(iris.data, columns = col) y = pd.DataFrame(iris.target,columns = ['cluster']) s = silhouette_score(X, y, metric='euclidean',sample_size=int(50)) 

I get an error message:

 IndexError: indices are out-of-bounds 

I want to use the sample_size parameter, because when working with very large datasets, the silhouette is too long to compute. Does anyone know how this option can work?

Full trace:

 --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-72-70ff40842503> in <module>() 4 X = pd.DataFrame(iris.data, columns = col) 5 y = pd.DataFrame(iris.target,columns = ['cluster']) ----> 6 s = silhouette_score(X, y, metric='euclidean',sample_size=50) /usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds) 81 X, labels = X[indices].T[indices].T, labels[indices] 82 else: ---> 83 X, labels = X[indices], labels[indices] 84 return np.mean(silhouette_samples(X, labels, metric=metric, **kwds)) 85 /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key) 1993 if isinstance(key, (np.ndarray, list)): 1994 # either boolean or fancy integer index -> 1995 return self._getitem_array(key) 1996 elif isinstance(key, DataFrame): 1997 return self._getitem_frame(key) /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_array(self, key) 2030 else: 2031 indexer = self.ix._convert_to_indexer(key, axis=1) -> 2032 return self.take(indexer, axis=1, convert=True) 2033 2034 def _getitem_multilevel(self, key): /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in take(self, indices, axis, convert) 2981 if convert: 2982 axis = self._get_axis_number(axis) -> 2983 indices = _maybe_convert_indices(indices, len(self._get_axis(axis))) 2984 2985 if self._is_mixed_type: /usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _maybe_convert_indices(indices, n) 1038 mask = (indices>=n) | (indices<0) 1039 if mask.any(): -> 1040 raise IndexError("indices are out-of-bounds") 1041 return indices 1042 IndexError: indices are out-of-bounds 
+6
source share
1 answer

silhouette_score expects numpy regular arrays as input. Why massage arrays in a data frame?

 >>> silhouette_score(iris.data, iris.target, sample_size=50) 0.52999903616584543 

From the trace, you can notice that the code does fantastic indexing (subsampling) on ​​the first axis. By default, indexing the frame will index the columns, not the rows, hence the problem you are observing.

+8
source

Source: https://habr.com/ru/post/959317/


All Articles