Cosine similarity when one of the vectors is all zeros

How to express the similarity of cosines ( http://en.wikipedia.org/wiki/Cosine_similarity )

when is one of the vectors all zeros?

v1 = [1, 1, 1, 1, 1]

v2 = [0, 0, 0, 0, 0]

When we calculate by the classical formula, we get the division by zero:

Let d1 = 0 0 0 0 0 0 Let d2 = 1 1 1 1 1 1 Cosine Similarity (d1, d2) = dot(d1, d2) / ||d1|| ||d2||dot(d1, d2) = (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) = 0 ||d1|| = sqrt((0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2) = 0 ||d2|| = sqrt((1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2) = 2.44948974278 Cosine Similarity (d1, d2) = 0 / (0) * (2.44948974278) = 0 / 0 

I want to use this measure of similarity in a clustering application. And I often have to compare such vectors. Also [0, 0, 0, 0, 0] versus [0, 0, 0, 0, 0]

Do you have any experience? Since this is a measure of similarity (not distance), I have to use a special case for

d ([1, 1, 1, 1, 1]; [0, 0, 0, 0, 0]) = 0

d ([0, 0, 0, 0, 0]; [0, 0, 0, 0, 0]) = 1

What about

d ([1, 1, 1, 0, 0]; [0, 0, 0, 0, 0]) =? and etc.

+5
source share
2 answers

If you have 0 vectors, cosine is the wrong similarity function for your application .

The cosine distance is essentially equivalent to the quadratic Euclidean distance from the normalized data L_2. That is, you normalize each vector to a unit of length 1, and then calculate the square of the Euclidean distance.

Another advantage of cosine is performance - calculating it on very rare, high-dimensional data is faster than Euclidean distance. It benefits from sparseness to the square, and not just linear.

While you obviously can try to crack the semblance of 0 when exactly one is equal to zero, and the maximum when they are identical, it will not really solve the main problems.

Do not select a distance by what you can easily calculate.

Instead, select a distance so that the result matters to your data. If the value is undefined, you have no value ...

Sometimes, maybe it’s all the same to refuse the data of constant 0 as meaningless data (for example, analyze the noise on Twitter and see a tweet, which is all numbers, without words). This is sometimes not the case.

+8
source

It is undefined.

Think that you have a vector C that is not zero on your zero vector. Multiply it by epsilon> 0 and start epsilon to zero. The result will depend on C, so the function is not continuous if one of the vectors is zero.

+1
source

Source: https://habr.com/ru/post/1206051/


All Articles