I am trying to create an image subtitle model.
modelV = createVGG16()
modelV.trainable = False
modelV.layers.pop()
modelV.layers.pop()
print 'LOADED VISION MODULE'
modelL = Sequential()
modelL.add(Embedding(self.vocab_size, 256, input_length=self.max_cap_len))
modelL.add(LSTM(128,return_sequences=True))
modelL.add(TimeDistributed(Dense(128)))
print 'LOADED LANGUAGE MODULE'
modelV.add(RepeatVector(self.max_cap_len))
print 'LOADED REPEAT MODULE'
model = Sequential()
model.add(Merge([modelV, modelL], mode='concat', concat_axis=-1))
model.add(LSTM(256,return_sequences=False))
model.add(Dense(self.vocab_size))
model.add(Activation('softmax'))
if(ret_model==True):
return model
model.compile(loss='categorical_crossentropy', optimizer='rmsprop', metrics=['accuracy'])
print 'COMBINED MODULES'
return model
I tried to run this model on all 5 signatures of the first 100 images of the FLickr8k test dataset over 50 eras. All signatures are added and merged. To generate the title, I give the input image and as the initial word. With each iteration, I predict the probability distribution by vocabulary and get the next word. In the next iteration, I give PredictedWord as an input and again generate a probability distribution.
What happens is that I get the same probability distribution in each time value.
My question is:
- Is my model too small to create captions?
- Are training data too small?
- Is the number of eras too small?
- ?