Word2vec: negative selection (in an unprofessional term)?

Question

Word2vec: negative selection (in an unprofessional term)?

I am reading the article below, and I have problems understanding the concept of negative sampling.

http://arxiv.org/pdf/1402.3722v1.pdf

Can someone please help?

+48

machine-learning nlp word2vec

Andy K Jan 09 '15 at 12:31 on

source share

2 answers

Computing Softmax (an activation function to determine which words are similar to the current target word) is expensive because it requires summing over all the words in V (the denominator), which is usually very large.

What can be done?

Various strategies have been proposed for approximate softmax. These approaches can be grouped into max and sample based . Softmax-based approaches are methods that keep the softmax level intact, but change its architecture to increase its effectiveness (for example, hierarchical softmax). Sample-based approaches , on the other hand, completely eliminate the softmax level and instead optimize some other loss function that approximates softmax (they do this by approximating the normalization of the softmax denominator with some other loss, which is cheap to calculate as a negative sample).

The loss function in Word2vec looks something like this:

Which logarithm can decompose into:

Using some mathematical and gradient formula (for more details see 2 ), it is converted to:

As you can see, it is transformed into a binary classification problem, since we need labels to perform our binary classification task, we designate all correct words w, considering their context c as true (y = 1, positive pattern) (all words in the window are target word), and k is randomly selected from corpura as false (y = 0, negative pattern).

Link :

(1) C. Dyer, “Observations on noise contrast estimation and negative sampling,” 2014
(2) http://sebastianruder.com/word-embeddings-softmax/

+10

Amir Dec 25 '16 at 7:32

source share

mbatchkarov · Accepted Answer · 2015-01-09 16:11

The idea of word2vec is to maximize the similarity (dot product) between the vectors of words that are close to each other (in the context of each other) in the text, and to minimize the similarity of words that do not. In equation (3) of the article you are referring to, ignore exponentiation for a moment. You

  v_c * v_w ------------------- sum(v_c1 * v_w)

The numerator is basically the similarity between the words c (context) and w (target) word. The denominator computes the similarity of all other contexts c1 and the target word w . Maximizing this ratio ensures that words that come together in the text have more similar vectors than words that don't. However, calculating this can be very slow because there are many c1 contexts. Negative sampling is one way to solve this problem - just select multiple c1 contexts at random. The end result is that if cat appears in the context of food , then the food vector is more like a cat vector (as measures for their point product) than the vectors of several other randomly selected words (e.g. democracy , greed , Freddy ) instead of all other words in the language . This greatly speeds up word2vec to exercise.

Word2vec: negative selection (in an unprofessional term)?

More articles: