Silhouette coefficient in python with sklearn

Question

Silhouette coefficient in python with sklearn

It is hard for me to calculate the silhouette coefficient in python using sklearn. Here is my code:

from sklearn import datasets from sklearn.metrics import * iris = datasets.load_iris() X = pd.DataFrame(iris.data, columns = col) y = pd.DataFrame(iris.target,columns = ['cluster']) s = silhouette_score(X, y, metric='euclidean',sample_size=int(50))

I get an error message:

 IndexError: indices are out-of-bounds

I want to use the sample_size parameter, because when working with very large datasets, the silhouette is too long to compute. Does anyone know how this option can work?

Full trace:

 --------------------------------------------------------------------------- IndexError Traceback (most recent call last) <ipython-input-72-70ff40842503> in <module>() 4 X = pd.DataFrame(iris.data, columns = col) 5 y = pd.DataFrame(iris.target,columns = ['cluster']) ----> 6 s = silhouette_score(X, y, metric='euclidean',sample_size=50) /usr/local/lib/python2.7/dist-packages/sklearn/metrics/cluster/unsupervised.pyc in silhouette_score(X, labels, metric, sample_size, random_state, **kwds) 81 X, labels = X[indices].T[indices].T, labels[indices] 82 else: ---> 83 X, labels = X[indices], labels[indices] 84 return np.mean(silhouette_samples(X, labels, metric=metric, **kwds)) 85 /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in __getitem__(self, key) 1993 if isinstance(key, (np.ndarray, list)): 1994 # either boolean or fancy integer index -> 1995 return self._getitem_array(key) 1996 elif isinstance(key, DataFrame): 1997 return self._getitem_frame(key) /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in _getitem_array(self, key) 2030 else: 2031 indexer = self.ix._convert_to_indexer(key, axis=1) -> 2032 return self.take(indexer, axis=1, convert=True) 2033 2034 def _getitem_multilevel(self, key): /usr/local/lib/python2.7/dist-packages/pandas/core/frame.pyc in take(self, indices, axis, convert) 2981 if convert: 2982 axis = self._get_axis_number(axis) -> 2983 indices = _maybe_convert_indices(indices, len(self._get_axis(axis))) 2984 2985 if self._is_mixed_type: /usr/local/lib/python2.7/dist-packages/pandas/core/indexing.pyc in _maybe_convert_indices(indices, n) 1038 mask = (indices>=n) | (indices<0) 1039 if mask.any(): -> 1040 raise IndexError("indices are out-of-bounds") 1041 return indices 1042 IndexError: indices are out-of-bounds

+6

python scikit-learn cluster-analysis

Scratch Dec 04 '13 at 11:20

source share

1 answer

ogrisel · Accepted Answer · 2013-12-04T18:26:46+0000

silhouette_score expects numpy regular arrays as input. Why massage arrays in a data frame?

 >>> silhouette_score(iris.data, iris.target, sample_size=50) 0.52999903616584543

From the trace, you can notice that the code does fantastic indexing (subsampling) on the first axis. By default, indexing the frame will index the columns, not the rows, hence the problem you are observing.

Silhouette coefficient in python with sklearn

More articles: