DBSCAN error with cosine metric in python

I tried to use the DBSCAN algorithm from the scikit-learn library with the cosine metric, but was stuck in an error. Line of code

db = DBSCAN(eps=1, min_samples=2, metric='cosine').fit(X) 

where X is csr_matrix . The error is as follows:

The metric "cosine" is not valid for the "auto" algorithm,

although the documentation says you can use this metric. I tried using the algorithm='kd_tree' and 'ball_tree' , but got the same. However, there is no error if I use the euclidean label or, say, l1 .

The matrix X is large, so I cannot use a pre-computed matrix of pairwise distances.

I am using python 2.7.6 and scikit-learn 0.16.1 . My dataset does not have a complete string of zeros, so the cosine metric is well defined.

+5
source share
2 answers

Indexes in sklearn (perhaps this may change with newer versions) cannot speed up cosine.

Try algorithm='brute' .

For a list of metrics your version of sklearn can speed up, see the supported metrics for the ball command:

 from sklearn.neighbors.ball_tree import BallTree print(BallTree.valid_metrics) 
+9
source

If you want a normalized distance, such as the distance from the cosine, you can also normalize your vectors first, and then use the Euclidean metric. Note that for two normalized vectors u and v, the Euclidean distance is sqrt (2-2 * cos (u, v)) ( see Discussion )

You can do something like:

 Xnorm = np.linalg.norm(X,axis = 1) Xnormed = np.divide(X,Xnorm.reshape(Xnorm.shape[0],1)) db = DBSCAN(eps=0.5, min_samples=2, metric='euclidean').fit(Xnormed) 

Distances will be at [0.2], so make sure you adjust your parameters accordingly.

+4
source

Source: https://habr.com/ru/post/1232121/


All Articles