That's a good question. Of course, you should take a look at the suggested posts for details, but a complete example will be useful here.
Rnn backpropgaion
I think it makes sense to talk about the usual RNN in the first place (because the LSTM diagram is especially confusing) and understands its reverse extension.
When it comes to backpropagation, the key idea is to deploy a network, which is a way of converting recursion in RNN into a forward feed sequence (like in the picture above). Note that the abstract RNN is eternal (it can be arbitrarily large), but each particular implementation is limited because memory is limited. As a result, the deployed network is indeed a long direct transmission network with few complications, for example. the weights in different layers are separated.
Let's look at a classic char -rnn example from Andrej Karpathy . Here, each RNN cell creates two outputs h[t]
(the state that is supplied to the next cell) and y[t]
(the output at this step) according to the following formulas, where Wxh
, Whh
and Why
are common parameters:

In the code, these are just three matrices and two displacement vectors:
The anterior passage is quite simple, this example uses softmax and cross-entropy loss. Note that each iteration uses the same arrays of W*
and h*
, but the output and hidden state are different:
Now the return pass is performed in the same way as if it were a direct transmission network, but the gradient of the arrays W*
and h*
accumulates gradients in all cells:
for t in reversed(xrange(len(inputs))): dy = np.copy(ps[t]) dy[targets[t]] -= 1 dWhy += np.dot(dy, hs[t].T) dby += dy dh = np.dot(Why.T, dy) + dhnext
Both passes above are performed in pieces of size len(inputs)
, which corresponds to the size of the deployed RNN. You might want to make it larger to capture longer dependencies in the input, but you pay for it while keeping all the outputs and gradients for each cell.
What sets LSTM apart
The image and LSTM formulas look awesome, but once you have encoded a simple vanilla RNN, the LSTM implementation is pretty much the same. For example, here is the return pass:
Summary
Now back to your questions.
My question is how LSTM backpropagation differs from regular neural networks
These are the total weights in different layers and several additional variables (states) that you need to pay attention to. Other than that, there is no difference whatsoever.
Do you use the first error (calculated by hypothesis minus output) for each shutter? Or will you correct the error for each shutter through some calculations?
First, the loss function is not necessarily L2. In the above example, this is cross-entropy loss, so the original error signal gets its gradient:
Please note that this is the same error signal as in a normal direct feed neural network. If you use L2 loss, the signal is really equal to the ground truth minus the actual output.
In the case of LSTM, this is a little more complicated: d_next_h = d_h_next_t + d_h[:,t,:]
, where d_h
is the gradient of the upward flow, the loss function, which means that the error signal of each cell is accumulated. But again, if you deploy LSTM, you will see direct correspondence with network wiring.