Spark MLLib Word2Vec cosine similarity greater than 1

Question

Spark MLLib Word2Vec cosine similarity greater than 1

http://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec

In the spark implementation of word2vec, when the number of iterations or sections of data is greater than one, for some reason the cosine similarity is greater than 1.

As far as I know, the similarity of cosine should always be around -1 <cos <1. Does anyone know why?

+4

machine-learning word2vec

Jason xie Oct 27 '15 at 4:54

source share

1 answer

Kotaro Tanahashi · Answer 1 · 2015-11-17T18:33:33+0000

In the findSynonymsmethod, word2vecit does not calculate the cosine similarity v1・vi / |v1| |vi|, instead it calculates v1・vi / |vi|where v1is the vector of the query word, and viis the vector of candidate words. Therefore, the value sometimes exceeds 1. To find closer words, there is no need to divide by |v1|, because it is permanent.

Spark MLLib Word2Vec cosine similarity greater than 1

More articles: