I think your confusion may be related to the Keras documentation, which is a bit unclear.
return_sequences: Boolean. Whether to return the last output in the output sequence, or the full sequence. return_state: Boolean. Whether to return the last state in addition to the output.
The documents on return_state particularly confusing because they imply that hidden states are different from exits, but they are one in one. For LSTM, this becomes a little muddy, because in addition to the hidden (output) states, there is a state of the cell. We can confirm this by looking at the LSTM step function in the Keras code:
class LSTM(Recurrent): def step(...): ... return h, [h, c]
The return type of this function is step output, states . Thus, we can see that the hidden state h is actually an output, and for the states we get both the hidden state h and the state of cell c . That's why you see a Wiki article that you link to using the terms โhiddenโ and โoutputโ interchangeably.
Looking at the paper that you tied a little closer, it seems to me that your initial implementation is what you want.
my_lstm = LSTM(128, input_shape=(a, b), return_sequences=True) my_lstm = AttentionWithContext()(my_lstm) out = Dense(2, activation='softmax')(my_lstm)
This will convey a latent state to every time level of your attention. The only scenario you are out of luck with is the one where you really want to pass the state of the cell from each point in time to your level of attention (this is what I thought initially), but I don't think that this is what Do you want to. The document you linked actually uses the GRU level, which has no idea about the state of the cell, and the step function also returns a hidden state as output.
class GRU(Recurrent): def step(...): ... return h, [h]
Thus, the article almost certainly refers to hidden states (aka exits), and not to cell states.