Computing Softmax (an activation function to determine which words are similar to the current target word) is expensive because it requires summing over all the words in V (the denominator), which is usually very large.
What can be done?
Various strategies have been proposed for approximate softmax. These approaches can be grouped into max and sample based . Softmax-based approaches are methods that keep the softmax level intact, but change its architecture to increase its effectiveness (for example, hierarchical softmax). Sample-based approaches , on the other hand, completely eliminate the softmax level and instead optimize some other loss function that approximates softmax (they do this by approximating the normalization of the softmax denominator with some other loss, which is cheap to calculate as a negative sample).
The loss function in Word2vec looks something like this:

Which logarithm can decompose into:

Using some mathematical and gradient formula (for more details see 2 ), it is converted to:

As you can see, it is transformed into a binary classification problem, since we need labels to perform our binary classification task, we designate all correct words w, considering their context c as true (y = 1, positive pattern) (all words in the window are target word), and k is randomly selected from corpura as false (y = 0, negative pattern).
Link :
Amir Dec 25 '16 at 7:32 2016-12-25 07:32
source share