What is the difference between a nesting layer and a dense layer?

The documents for the Keras attachment layer say:

Turns positive integers (indices) into dense vectors of a fixed size. eg. [[4], [20]] → [[0.25, 0.1], [0.6, -0.2]]

I believe that this can also be achieved by encoding the input data as one-time vocabulary_size length vectors and feeding them into a dense layer .

Is the embedded layer just a convenience for this two-step process, or is something strange happening under the hood?

+18
source share
2 answers

Mathematically, the difference is as follows:

  • The implementation level performs a selection operation. In keras, this layer is equivalent to:

     K.gather(self.embeddings, inputs) # just one matrix 
  • A dense layer performs a point product operation plus additional activation:

     outputs = matmul(inputs, self.kernel) # a kernel matrix outputs = bias_add(outputs, self.bias) # a bias vector return self.activation(outputs) # an activation function 

You can emulate an embedding layer with a fully connected layer through single-string coding, but the whole point of tight embedding is to avoid a single presentation. In NLP, the size of a dictionary can be about 100 thousand (sometimes even a million). In addition, you often need to process word sequences in a batch. Processing a batch of sequences of word indices will be much more efficient than a sequence of sequences of single-jet vectors. In addition, the gather operation itself is faster than the dot matrix product, both in the forward and reverse directions.

+25
source

An embedded layer is faster because it is essentially the equivalent of a dense layer, which makes simplifying assumptions.

Imagine a layer with the word for embedding with such weights:

 w = [[0.1, 0.2, 0.3, 0.4], [0.5, 0.6, 0.7, 0.8], [0.9, 0.0, 0.1, 0.2]] 

The Dense layer will process them as the actual weights with which matrix multiplication needs to be performed . The embedding layer will simply process these weights as a list of vectors, each of which represents a single word ; The 0th word in the dictionary is w[0] , the 1st word is w[1] , etc.


For example, use the weights above and this sentence:

 [0, 2, 1, 2] 

Naive Dense -based network needs to convert this sentence to hot coding

 [[1, 0, 0], [0, 0, 1], [0, 1, 0], [0, 0, 1]] 

then multiply the matrix

 [[1 * 0.1 + 0 * 0.5 + 0 * 0.9, 1 * 0.2 + 0 * 0.6 + 0 * 0.0, 1 * 0.3 + 0 * 0.7 + 0 * 0.1, 1 * 0.4 + 0 * 0.8 + 0 * 0.2], [0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2], [0 * 0.1 + 1 * 0.5 + 0 * 0.9, 0 * 0.2 + 1 * 0.6 + 0 * 0.0, 0 * 0.3 + 1 * 0.7 + 0 * 0.1, 0 * 0.4 + 1 * 0.8 + 0 * 0.2], [0 * 0.1 + 0 * 0.5 + 1 * 0.9, 0 * 0.2 + 0 * 0.6 + 1 * 0.0, 0 * 0.3 + 0 * 0.7 + 1 * 0.1, 0 * 0.4 + 0 * 0.8 + 1 * 0.2]] 

=

 [[0.1, 0.2, 0.3, 0.4], [0.9, 0.0, 0.1, 0.2], [0.5, 0.6, 0.7, 0.8], [0.9, 0.0, 0.1, 0.2]] 

However, the Embedding layer just looks at [0, 2, 1, 2] and takes the weights of the index layer zero, two, one and two to get immediately

 [w[0], w[2], w[1], w[2]] 

=

 [[0.1, 0.2, 0.3, 0.4], [0.9, 0.0, 0.1, 0.2], [0.5, 0.6, 0.7, 0.8], [0.9, 0.0, 0.1, 0.2]] 

So this is the same result just obtained, I hope, faster.


Embedding level has limitations:

  • The input must be integers in [0, vocab_length).
  • There is no bias.
  • No activation.

However, none of these restrictions should matter if you just want to convert an integer-encoded word into an attachment.

+2
source

Source: https://habr.com/ru/post/1274148/


All Articles