I am using a language model using keras.
Basically, my word size N is ~ 30,000, I already trained word2vec, so I use attachments and then LSTM, and then predict the next word with a fully connected layer, followed by softmax. My model is written below:
EMBEDDING_DIM = 256
embedding_layer = Embedding(N,EMBEDDING_DIM,weights=[embeddings],
trainable=False)
model = Sequential()
model.add(embedding_layer)
model.add(LSTM(EMBEDDING_DIM))
model.add(Dense(N))
model.add(Activation('softmax'))
model.compile(loss="categorical_crossentropy", optimizer="rmsprop")
I have two questions:
In this case, you can confirm that we use only the last hidden LSTM layer (followed by a fully connected layer and softmax), and there is nothing like the maximum / average pool of consecutive hidden lstm layers (for example, here to analyze the mood http: // deeplearning.net/tutorial/lstm.html )?
, , lstm N (30,000), EMBEDDING_DIM , - mse, "" , , ?
!