What is a repeating neural network, what is long short memory (LSTM), and is it always better?

First, let me apologize for typing three questions in this title. I'm not sure the best way is there.

I will be entitled to it. I think I understand the nerves well.

But LSTM really eludes me, and I feel maybe because I am not very good at repeating neural networks in general. I took the course of Hinton and Andrew Ng on Kurser. Many of them still do not make sense to me.

From what I understand, repeating neural networks are different from related neural networks, as past values ​​influence the following prediction. A recurrent neural network is commonly used for sequences.

The example I saw in a recurrent neural network was binary.

010 + 011 

A recurrent neural network will take at most 0 and 1 first, output a 1. Then take 1 next, print zero and transfer 1. Take the next 0,0 and print a 1, because it transferred 1 from the last calculation. Where does he store this 1? In root networks, the result is mainly:

  y = a(w*x + b) where w = weights of connections to previous layer and x = activation values of previous layer or inputs 

How is a recurrent neural network calculated? I am probably mistaken, but from what I understand, recurrent neural networks are a fairly strong neural network with hidden layers T, T is the number of time stamps. And each hidden layer accepts an input X in timestep T, and its outputs are then added to the next corresponding tabs of the hidden layer.

  a(l) = a(w*x + b + pa) where l = current timestep and x = value at current timestep and w = weights of connections to input layer and pa = past activation values of hidden layer such that neuron i in layer l uses the output value of neuron i in layer l-1 y = o(w*a(l-1) + b) where w = weights of connections to last hidden layer 

But even if I understood this correctly, I don’t see the benefits of this, simply using past values ​​as input to a normal direct access network (a sliding window or something else that he called).

For example, what is the advantage of using a recurrent neural network for binary addition instead of training a training network with two output neurons. One for binary result and one for transfer? And then grab the transfer pin and plug it back into the direct network.

However, I'm not sure how this differs from how easy it is to have past values ​​as input in the forward model.

It seems to me that the more timesteps, recurrent neural networks are only a drawback of more reliable networks due to the disappearance of the gradient. This brings me to my second question, from what I understood, LSTM is a solution to the fading gradient problem. But I don’t understand how they work. Also, are they just better than recurrent neural networks, or are there casualties when using LSTM?

+6
source share
3 answers

What is a repeating neural network?

The basic idea is that repeating networks have loops. These loops allow the network to use information from previous passes, which acts like memory. The length of this memory depends on a number of factors, but it is important to note that it is not uncertain. You may think that memory is degrading, and older information is becoming less accessible.

For example, let's say we just want the network to do one thing: remember if the input from an earlier value was 1 or 0. It is easy to imagine a network that just constantly goes 1 around in a loop. However, every time you send to 0, the output going into the loop gets a little lower (this is a simplification, but it displays the idea). After a certain number of passes, the entrance to the circuit will be arbitrarily low, making the conclusion of network 0. As you know, the problem of the disappearance of the gradient is essentially the same, but vice versa.

Why not just use the temporary inputs window?

You offer an alternative: a sliding window of past inputs is provided as current inputs. This is not a bad idea, but keep the following in mind: although the RNN may erode over time, you will always lose all your time information after your window ends. And although you remove the problem of the disappearance of the gradient, you will have to increase the number of weights of your network several times. Involving all of these extra weights in training will do you harm as badly as (if not worse) a fading gradient.

What is an LSTM network?

You can think of LSTM as a special type of RNN. The difference is that LSTM can actively support self-connecting loops without sacrificing quality. This is achieved through some bizarre activation, including an additional output of "memory" for a self-synchronous connection. Then, the network must be trained to select which data will be downloaded to this bus. When teaching the network about explicit choices that you need to remember, we don’t have to worry about new entries destroying important information, and a fading gradient does not affect the information we choose to save.

There are two main disadvantages:

  • It is much more expensive to calculate the network output and apply back propagation. Due to complex activation, you just need more math. However, this is not as important as the second point.
  • Explicit memory adds a few more weights to each node, all of which need to be trained. This increases the dimension of the problem and potentially makes it difficult to find the optimal solution.

Is it always better?

Which structure is better depends on a number of factors, such as the number of nodes you need for your problem, the amount of data available and how far back you want your network memory to reach. However, if you only need a theoretical answer, I would say that when using infinite data and computational speed, LSTM is the best choice, but you should not take this as practical advice.

+8
source

A feedback neural network has connections from level n to level n + 1.

A recurrent neural network also allows you to bind to layer n with layer n.

These loops allow the network to perform calculations based on data from previous cycles, which creates network memory. The length of this memory depends on a number of factors and is an area of ​​active research, but it can be from tens to hundreds of time steps.

To make this clearer, the portable 1 in your example is stored in the same way as the inputs: in the activation template of the neural layer. These are just recurring (identical layers) compounds that allow 1 to persist over time.

Obviously, it would be impossible to reproduce each input stream by more than a few past time steps, and the selection of important historical flows would be very difficult (and lead to a decrease in flexibility).

LSTM is a completely different model, which I know only in comparison with the PBWM model, but in this review, LSTM was able to actively support neural representations for an indefinite period, so I believe that it is more intended for explicit storage. RNNs are more suitable for training in non-linear time series than for storage. I do not know if there are drawbacks to using LSTM, and not RNN.

+5
source

Both RNN and LSTM can be sequence students. RNN suffers from a disappearing gradient point problem. This problem causes the RNN to have problems remembering the values ​​of past inputs after more than 10 time intervals of approx. (RNN can remember previously seen inputs for only a few time steps)

LSTM is designed to solve the problem of vanishing gradient points in RNN. LSTM has the ability to bridge long delays between inputs. In other words, he is able to memorize entries from up to 1000 time steps in the past (some papers even stated that they could go more than that). This feature makes LSTM an advantage for studying long sequences with long delays. See Alex Graves Ph.D. thesis Proper sequence labeling with repeating neural networks for some details. If you're new to LSTM, I recommend the Colah blog for a super simple and easy explanation.

However, recent advances in RNN also claim that with careful initialization, RNN can also study long sequences comparable to LSTM performance. An easy way to initialize repetitive rectified line block networks .

+3
source

Source: https://habr.com/ru/post/972682/


All Articles