LSTM RNN Backpropagation

Question

LSTM RNN Backpropagation

Can someone give a clear explanation of backpropagation for RNN LSTM? This is the type structure I work with. My question is not asked during backpropagation, I understand that this is a method of the reverse order of calculating hypothesis errors and inferences used to adjust the weights of neural networks. My question is how LSTM backpropagation differs from regular neural networks.

I am not sure how to find the initial error for each of the gates. Do you use the first error (calculated by hypothesis minus output) for each shutter? Or will you correct the error for each shutter through some calculations? I'm not sure how cell state plays a role in backprop LSTM, if that happens at all. I carefully looked at a good source for LSTM, but have not found it yet.

+6

machine-learning neural-network lstm recurrent-neural-network backpropagation

Jjoseph Jan 9 '17 at 19:35

source share

2 answers

Maxim · Answer 1 · 2017-10-11T13:51:50+0000

That's a good question. Of course, you should take a look at the suggested posts for details, but a complete example will be useful here.

Rnn backpropgaion

I think it makes sense to talk about the usual RNN in the first place (because the LSTM diagram is especially confusing) and understands its reverse extension.

When it comes to backpropagation, the key idea is to deploy a network, which is a way of converting recursion in RNN into a forward feed sequence (like in the picture above). Note that the abstract RNN is eternal (it can be arbitrarily large), but each particular implementation is limited because memory is limited. As a result, the deployed network is indeed a long direct transmission network with few complications, for example. the weights in different layers are separated.

Let's look at a classic char -rnn example from Andrej Karpathy . Here, each RNN cell creates two outputs h[t] (the state that is supplied to the next cell) and y[t] (the output at this step) according to the following formulas, where Wxh , Whh and Why are common parameters:

In the code, these are just three matrices and two displacement vectors:

 # model parameters Wxh = np.random.randn(hidden_size, vocab_size)*0.01 # input to hidden Whh = np.random.randn(hidden_size, hidden_size)*0.01 # hidden to hidden Why = np.random.randn(vocab_size, hidden_size)*0.01 # hidden to output bh = np.zeros((hidden_size, 1)) # hidden bias by = np.zeros((vocab_size, 1)) # output bias

The anterior passage is quite simple, this example uses softmax and cross-entropy loss. Note that each iteration uses the same arrays of W* and h* , but the output and hidden state are different:

 # forward pass for t in xrange(len(inputs)): xs[t] = np.zeros((vocab_size,1)) # encode in 1-of-k representation xs[t][inputs[t]] = 1 hs[t] = np.tanh(np.dot(Wxh, xs[t]) + np.dot(Whh, hs[t-1]) + bh) # hidden state ys[t] = np.dot(Why, hs[t]) + by # unnormalized log probabilities for next chars ps[t] = np.exp(ys[t]) / np.sum(np.exp(ys[t])) # probabilities for next chars loss += -np.log(ps[t][targets[t],0]) # softmax (cross-entropy loss)

Now the return pass is performed in the same way as if it were a direct transmission network, but the gradient of the arrays W* and h* accumulates gradients in all cells:

 for t in reversed(xrange(len(inputs))): dy = np.copy(ps[t]) dy[targets[t]] -= 1 dWhy += np.dot(dy, hs[t].T) dby += dy dh = np.dot(Why.T, dy) + dhnext # backprop into h dhraw = (1 - hs[t] * hs[t]) * dh # backprop through tanh nonlinearity dbh += dhraw dWxh += np.dot(dhraw, xs[t].T) dWhh += np.dot(dhraw, hs[t-1].T) dhnext = np.dot(Whh.T, dhraw)

Both passes above are performed in pieces of size len(inputs) , which corresponds to the size of the deployed RNN. You might want to make it larger to capture longer dependencies in the input, but you pay for it while keeping all the outputs and gradients for each cell.

What sets LSTM apart

The image and LSTM formulas look awesome, but once you have encoded a simple vanilla RNN, the LSTM implementation is pretty much the same. For example, here is the return pass:

 # Loop over all cells, like before d_h_next_t = np.zeros((N, H)) d_c_next_t = np.zeros((N, H)) for t in reversed(xrange(T)): d_x_t, d_h_prev_t, d_c_prev_t, d_Wx_t, d_Wh_t, d_b_t = lstm_step_backward(d_h_next_t + d_h[:,t,:], d_c_next_t, cache[t]) d_c_next_t = d_c_prev_t d_h_next_t = d_h_prev_t d_x[:,t,:] = d_x_t d_h0 = d_h_prev_t d_Wx += d_Wx_t d_Wh += d_Wh_t d_b += d_b_t # The step in each cell # Captures all LSTM complexity in few formulas. def lstm_step_backward(d_next_h, d_next_c, cache): """ Backward pass for a single timestep of an LSTM. Inputs: - dnext_h: Gradients of next hidden state, of shape (N, H) - dnext_c: Gradients of next cell state, of shape (N, H) - cache: Values from the forward pass Returns a tuple of: - dx: Gradient of input data, of shape (N, D) - dprev_h: Gradient of previous hidden state, of shape (N, H) - dprev_c: Gradient of previous cell state, of shape (N, H) - dWx: Gradient of input-to-hidden weights, of shape (D, 4H) - dWh: Gradient of hidden-to-hidden weights, of shape (H, 4H) - db: Gradient of biases, of shape (4H,) """ x, prev_h, prev_c, Wx, Wh, a, i, f, o, g, next_c, z, next_h = cache d_z = o * d_next_h d_o = z * d_next_h d_next_c += (1 - z * z) * d_z d_f = d_next_c * prev_c d_prev_c = d_next_c * f d_i = d_next_c * g d_g = d_next_c * i d_a_g = (1 - g * g) * d_g d_a_o = o * (1 - o) * d_o d_a_f = f * (1 - f) * d_f d_a_i = i * (1 - i) * d_i d_a = np.concatenate((d_a_i, d_a_f, d_a_o, d_a_g), axis=1) d_prev_h = d_a.dot(Wh.T) d_Wh = prev_h.T.dot(d_a) d_x = d_a.dot(Wx.T) d_Wx = xTdot(d_a) d_b = np.sum(d_a, axis=0) return d_x, d_prev_h, d_prev_c, d_Wx, d_Wh, d_b

Summary

Now back to your questions.

My question is how LSTM backpropagation differs from regular neural networks

These are the total weights in different layers and several additional variables (states) that you need to pay attention to. Other than that, there is no difference whatsoever.

Do you use the first error (calculated by hypothesis minus output) for each shutter? Or will you correct the error for each shutter through some calculations?

First, the loss function is not necessarily L2. In the above example, this is cross-entropy loss, so the original error signal gets its gradient:

 # remember that ps is the probability distribution from the forward pass dy = np.copy(ps[t]) dy[targets[t]] -= 1

Please note that this is the same error signal as in a normal direct feed neural network. If you use L2 loss, the signal is really equal to the ground truth minus the actual output.

In the case of LSTM, this is a little more complicated: d_next_h = d_h_next_t + d_h[:,t,:] , where d_h is the gradient of the upward flow, the loss function, which means that the error signal of each cell is accumulated. But again, if you deploy LSTM, you will see direct correspondence with network wiring.

Thor · Answer 2 · 2017-01-10T21:50:52+0000

I think that a short answer cannot answer your questions. Nico Simple LSTM has a link to an excellent article from Lipton et.al. Please read this. Also his simple python code example helps answer most of your questions. If you understand Niko’s last sentence ds = self.state.o * top_diff_h + top_diff_s in detail, please give me feedback. At the moment, I have a final problem with his "Combine all these s and h derivatives together."

LSTM RNN Backpropagation

Rnn backpropgaion

What sets LSTM apart

Summary

More articles: