Cosine similarity when one of the vectors is all zeros

Question

Cosine similarity when one of the vectors is all zeros

How to express the similarity of cosines ( http://en.wikipedia.org/wiki/Cosine_similarity )

when is one of the vectors all zeros?

v1 = [1, 1, 1, 1, 1]

v2 = [0, 0, 0, 0, 0]

When we calculate by the classical formula, we get the division by zero:

Let d1 = 0 0 0 0 0 0 Let d2 = 1 1 1 1 1 1 Cosine Similarity (d1, d2) = dot(d1, d2) / ||d1|| ||d2||dot(d1, d2) = (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) + (0)*(1) = 0 ||d1|| = sqrt((0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2 + (0)^2) = 0 ||d2|| = sqrt((1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2 + (1)^2) = 2.44948974278 Cosine Similarity (d1, d2) = 0 / (0) * (2.44948974278) = 0 / 0

I want to use this measure of similarity in a clustering application. And I often have to compare such vectors. Also [0, 0, 0, 0, 0] versus [0, 0, 0, 0, 0]

Do you have any experience? Since this is a measure of similarity (not distance), I have to use a special case for

d ([1, 1, 1, 1, 1]; [0, 0, 0, 0, 0]) = 0

d ([0, 0, 0, 0, 0]; [0, 0, 0, 0, 0]) = 1

What about

d ([1, 1, 1, 0, 0]; [0, 0, 0, 0, 0]) =? and etc.

+5

machine-learning cluster-analysis data-mining cosine-similarity

Sebastian widz Nov 02 '14 at 13:13

source share

2 answers

It is undefined.

Think that you have a vector C that is not zero on your zero vector. Multiply it by epsilon> 0 and start epsilon to zero. The result will depend on C, so the function is not continuous if one of the vectors is zero.

+1

Gyro gearloose Nov 02 '14 at 13:27

source share

Anony-mousse · Accepted Answer · 2014-11-02T19:34:12+0000

If you have 0 vectors, cosine is the wrong similarity function for your application .

The cosine distance is essentially equivalent to the quadratic Euclidean distance from the normalized data L_2. That is, you normalize each vector to a unit of length 1, and then calculate the square of the Euclidean distance.

Another advantage of cosine is performance - calculating it on very rare, high-dimensional data is faster than Euclidean distance. It benefits from sparseness to the square, and not just linear.

While you obviously can try to crack the semblance of 0 when exactly one is equal to zero, and the maximum when they are identical, it will not really solve the main problems.

Do not select a distance by what you can easily calculate.

Instead, select a distance so that the result matters to your data. If the value is undefined, you have no value ...

Sometimes, maybe it’s all the same to refuse the data of constant 0 as meaningless data (for example, analyze the noise on Twitter and see a tweet, which is all numbers, without words). This is sometimes not the case.

Cosine similarity when one of the vectors is all zeros

More articles: