Keras LSTM input sizes with one insert for hot text

I have 70 thousand text samples that I have implemented using Keras preprocessing. This gives me an array [40, 20, 142...]which I then pad for length 28 (the longest sample length). All I'm trying to do is predict these values ​​for some sort of categorical label (say, from 0 to 5). When I train the model, I can’t get anything except -13% accuracy (currently my mistake is that I tried many ways to pass input).

This is my data at the moment, and I'm just trying to create a simple LSTM. Again, my data is X → [length 28 integer values, embeddings] and Y → [1 integer length 3, (100, 143, etc.)]. Any ideas what I'm doing wrong ?? I asked many people and no one could help. Here is the code for my model ... any ideas? :(

optimizer = RMSprop(lr=0.01) #saw this online, no idea
model = Sequential()
model.add(Embedding(input_dim=28,output_dim=1,init='uniform')) #28 features, 1 dim output?
model.add(LSTM(150)) #just adding my LSTM nodes
model.add(Dense(1)) #since I want my output to be 1 integer value

model.compile(loss='sparse_categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
print(model.summary())

Edit:

use model.add(Embedding(input_dim=900,output_dim=8,init='uniform'))seems to work, however accuracy still never improves; I don't know what to do.

+6
source share
2 answers

I have two suggestions.

  • (y). Y , . , .
  • word2vec, , .

optimizer = RMSprop(lr=0.01) 
embedding_vecor_length = 32
max_review_length = 28
nb_classes= 8
model = Sequential()
model.add(Embedding(input_dim=900,embedding_vecor_length,input_length=max_review_length)) #input_length is the number of features
model.add(LSTM(150))
model.add(Dense(output_dim=nb_classes, activation='sigmoid')) #output_dim is a categorical variable with 8 classes

model.compile(loss='categorical_crossentropy', optimizer=optimizer, metrics=['accuracy'])
print(model.summary())

model.fit(X_train, y_train, nb_epoch=3, batch_size=64)

# Final evaluation of the model
scores = model.evaluate(X_test, y_test, verbose=0)
print("Accuracy: %.2f%%" % (scores[1]*100))
+5

. keras.preprocessing.text

:

tokenizer = Tokenizer()

tokenizer.fit_on_texts(data)
tokenized_data = tokenizer.texts_to_sequences(data)

Keras : https://keras.io/preprocessing/text/

Tokenizer: https://github.com/sreeram004/Machine-Learning/blob/master/Youtube%20Spam%20Classification%20using%20CNN%20LSTM/youtube-spam-classification-cnn_lstm.py

0

Source: https://habr.com/ru/post/1664373/


All Articles