Understanding the structure and structure output of Skip-Gram Word2Vec

My question is twofold, but hopefully not too complicated. And both parts are specifically related to the Skip-Gram model in Word2Vec:

  • The first part concerns the structure: as I understand it, the Skip-Gram model is based on one neural network with one input weight matrix W , one hidden layer of size N and C weight matrices W ' , each of which is used to create one of the output vectors C. Correct is it?

  • The second part relates to output vectors: as I understand it, each output vector has size V and is the result of the Softmax function. Each output vector node corresponds to the index of the word in the dictionary, and the value of each node is the probability that the corresponding word occurs in this place of the context (for a given input word). However, the target output vectors are not one-quarter, even if training instances are used. Is it correct?

The way I imagine it is something in the following lines (example :):

Assuming that the dictionary (“fast”, “fox”, “hopping”, “lazy”, “dog”) and the context C = 1, and assuming that for the input word “jumped”, I see two output vectors look like this :

[0.2 0.6 0.01 0.1 0.09]

[0.2 0.2 0.01 0.16 0.43 ]

I would interpret this as a "fox", which is the most likely word that should appear before "bouncing" (p = 0.6), and a "dog" will most likely appear after it (p = 0.43).

Do I have this right? Or am I completely disconnected? Any help is appreciated.

+5
source share
2 answers

This is my first answer to SO, so here it is.

Your understanding in both parts seems correct, according to this article:

http://arxiv.org/abs/1411.2738

The document describes word2vec in detail and at the same time, it is very simple - it is worth reading for a complete understanding of the architecture of the neural network used in word2vec.

  • The Skip Gram structure uses a single neural network with input in the form of a hot coded target word and expected-output as single-line encoded context words. After training the neural network in a text case, the input weight matrix W is used as input vector representations of words in the body and the output weight matrix W ' , which is divided into all outputs C (output vectors in question terminology, but avoiding this to prevent confusion with the representations of the output vectors used later ..) becomes the output vector of the word representation. Typically, representations of the output vector are ignored, and representations of the input vector W are used as word embeddings. To get into the dimension of matrices, assuming the size of the dictionary is V , the size of the hidden layer is N , we will have W as the matrix (V, N) , and each row represents the input vector of the indexed word in the dictionary. W ' will be a matrix (N, V) , with each column representing the output vector of the indexed word. Thus, we get N-dimensional vectors for words.
  • As you mentioned, each of the outputs (excluding the use of the term “output vector”) has size V and is the result of the softmax function, with each node giving the probability of a word occurring as a context word for a given target word, as a result of which the outputs are not hot encoded. But the expected outputs are indeed one hot coding, that is, in the training phase, the error is calculated by subtracting the one-time code, the hot coded vector of the actual word occurring in this context position from the output of the neural network, and then the scales are updated using gradient descent.

Referring to your example with C = 1 and vocabulary ['quick', 'fox', 'jumped', 'lazy', 'dog']

If the output from the missing graph is [0.2 0.6 0.01 0.1 0.09], where the correct target word is “fox,” then the error is calculated as -

[0 1 0 0 0] - [0.2 0.6 0.01 0.1 0.09] = [-0.2 0.4 -0.01 -0.1 -0.09]

and weight matrices are updated to minimize this error.

Hope this helps!

0
source

No. You can freely specify the length of the vector.

Then what is a vector?

This is a distributed representation of the meaning of a word.

I do not understand how it can be trained. but, trained, matters as shown below.

If one vector representation like this,

[0.2 0.6 0.2]

It is closer to [0.2 0.7 0.2] than [0.7 0.2 0.5].

Here is another example.

CRY [0.5 0.7 0.2]

HAPPY [-0.4 0.3 0.1]

SAD [0.4 0.6 0.2]

"CRY" is closer to "SAD" than "HAPPY" because methods (CBOW or SKIP-GRAM, etc.) can make vectors closer when the meaning (or syntactic position) of words is similar.

In fact, exactly depends on many things. The choice of methods is also important. and a lot of good data (corpura), too.

If you want to check the similarity of some words, first create the word vectors and check the cosine similarity of these words.

The document ( https://arxiv.org/pdf/1301.3781.pdf ) explained some of the methods and the listed accuracy.

You can understand the c codes, it is useful to use the word2vec program ( https://code.google.com/archive/p/word2vec/ ). It implements CBOW (Continuous Bag-Of-Words) and SKIP grams.

ps) Please correct my bad english. ps) if you have a question, yet.

0
source

Source: https://habr.com/ru/post/1238669/


All Articles