Spark MLLib Word2Vec cosine similarity greater than 1

http://spark.apache.org/docs/latest/mllib-feature-extraction.html#word2vec

In the spark implementation of word2vec, when the number of iterations or sections of data is greater than one, for some reason the cosine similarity is greater than 1.

As far as I know, the similarity of cosine should always be around -1 <cos <1. Does anyone know why?

+4
source share
1 answer

In the findSynonymsmethod, word2vecit does not calculate the cosine similarity v1・vi / |v1| |vi|, instead it calculates v1・vi / |vi|where v1is the vector of the query word, and viis the vector of candidate words. Therefore, the value sometimes exceeds 1. To find closer words, there is no need to divide by |v1|, because it is permanent.

+3
source

Source: https://habr.com/ru/post/1613287/


All Articles