Failed to convert integer scalar error when using DBSCAN

I am trying to use sciskit-learn DBSCAN to cluster multiple documents. First, I create a TF-IDF matrix using scikit-learn TfidfVectorizer (this is a permitted matrix 163405x13029 of type numpy.float64). Then I try to group certain subsets of this matrix. Things work great when a subset is small (say, up to several thousand lines). But with large subsets (with tens of thousands of rows), I get ValueError: could not convert integer scalar .

Here's the full trace ( idxs is a list of indexes):


 ValueError Traceback (most recent call last) <ipython-input-1-73ee366d8de5> in <module>() 193 # use descriptions to clusterize items 194 ncm_clusterizer = DBSCAN() --> 195 ncm_clusterizer.fit_predict(tfidf[idxs]) 196 idxs_clusters = list(zip(idxs, ncm_clusterizer.labels_)) 197 for e in idxs_clusters: /usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit_predict(self, X, y, sample_weight) 294 cluster labels 295 """ --> 296 self.fit(X, sample_weight=sample_weight) 297 return self.labels_ /usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in fit(self, X, y, sample_weight) 264 X = check_array(X, accept_sparse='csr') 265 clust = dbscan(X, sample_weight=sample_weight, --> 266 **self.get_params()) 267 self.core_sample_indices_, self.labels_ = clust 268 if len(self.core_sample_indices_): /usr/local/lib/python3.4/site-packages/sklearn/cluster/dbscan_.py in dbscan(X, eps, min_samples, metric, algorithm, leaf_size, p, sample_weight, n_jobs) 136 # This has worst case O(n^2) memory complexity 137 neighborhoods = neighbors_model.radius_neighbors(X, eps, --> 138 return_distance=False) 139 140 if sample_weight is None: /usr/local/lib/python3.4/site-packages/sklearn/neighbors/base.py in radius_neighbors(self, X, radius, return_distance) 584 if self.effective_metric_ == 'euclidean': 585 dist = pairwise_distances(X, self._fit_X, 'euclidean', --> 586 n_jobs=self.n_jobs, squared=True) 587 radius *= radius 588 else: /usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in pairwise_distances(X, Y, metric, n_jobs, **kwds) 1238 func = partial(distance.cdist, metric=metric, **kwds) 1239 -> 1240 return _parallel_pairwise(X, Y, func, n_jobs, **kwds) 1241 1242 /usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in _parallel_pairwise(X, Y, func, n_jobs, **kwds) 1081 if n_jobs == 1: 1082 # Special case to avoid picklability checks in delayed -> 1083 return func(X, Y, **kwds) 1084 1085 # TODO: in some cases, backend='threading' may be appropriate /usr/local/lib/python3.4/site-packages/sklearn/metrics/pairwise.py in euclidean_distances(X, Y, Y_norm_squared, squared, X_norm_squared) 243 YY = row_norms(Y, squared=True)[np.newaxis, :] 244 --> 245 distances = safe_sparse_dot(X, YT, dense_output=True) 246 distances *= -2 247 distances += XX /usr/local/lib/python3.4/site-packages/sklearn/utils/extmath.py in safe_sparse_dot(a, b, dense_output) 184 ret = a * b 185 if dense_output and hasattr(ret, "toarray"): --> 186 ret = ret.toarray() 187 return ret 188 else: /usr/local/lib/python3.4/site-packages/scipy/sparse/compressed.py in toarray(self, order, out) 918 def toarray(self, order=None, out=None): 919 """See the docstring for `spmatrix.toarray`.""" --> 920 return self.tocoo(copy=False).toarray(order=order, out=out) 921 922 ############################################################## /usr/local/lib/python3.4/site-packages/scipy/sparse/coo.py in toarray(self, order, out) 256 M,N = self.shape 257 coo_todense(M, N, self.nnz, self.row, self.col, self.data, --> 258 B.ravel('A'), fortran) 259 return B 260 ValueError: could not convert integer scalar 

I am using Python 3.4.3 (on Red Hat), scipy 0.18.1 and scikit-learn 0.18.1.

I tried the monkey patch suggested here , but that didn't work.

Turning on googling, I found bugfix , which apparently solved the same problem for other types of sparse matrices (e.g. csr), but not for coo.

I tried to submit a DBSCAN sparse neighborhood graph of radius (instead of a feature matrix) as suggested here , but the same error happens.

I tried HDBSCAN but the same error occurs.

How can I fix this or get around this?

+6
source share
1 answer

Even if the implementation allows this, DBSCAN is likely to give poor results on such very large dimensional data (from a statistical point of view, due to the curse of dimensionality).

Instead, I would recommend using the TruncatedSVD class to reduce the dimension of your TF-IDF function vectors to 50 or 100 components, and then apply DBSCAN to the results.

+3
source

Source: https://habr.com/ru/post/1015387/


All Articles