Doubts about batch size and time steps in RNN

In the Tensorflow RNN tutorial: https://www.tensorflow.org/tutorials/recurrent , It mentions two parameters: batch size and time intervals. I am confused by the concepts. In my opinion, RNN introduces a batch because the fact that the travel sequence can be very long, so backpropagation cannot calculate this long (growing / disappearing gradients). Thus, we divide the sequence of the long sequence into shorter sequences, each of which is a mini-packet and the size of which is called "batch size". I'm here?

Regarding time steps, an RNN consists of only a cell (LSTM or GRU cell or another cell), and this cell is sequential. We can understand a coherent concept by unfolding it. But the deployment of a sequential cell is a concept, not a real one, which means that we do not implement it by deployment. Suppose the train sequence is a text box. Then each time we bring one word to the RNN cell, and then update the weights. So why do we have time steps here? Combining my understanding of the aforementioned “party size”, I became even more confused. Do we sign a cell with one word or several words (lot size)?

+5
source share
3 answers

The batch size depends on the number of training samples that must be considered when updating the network weight. So, suppose you want to update your network loads based on calculating your gradients one word at a time, your batch_size = 1. Since gradients are calculated from a single sample, it is computationally very cheap. On the other hand, it is also very erratic learning.

To understand what happens during the training of such a direct access network, I will give you this very nice visual example of single_batch versus mini_batch for training single_sample .

However, you want to understand what happens to your num_steps variable. This is not the same as your batch_size. As you may have noticed, so far I have mentioned direct networks. In a network with straight lines, the output is determined from the network inputs, and the input-output relationship is displayed through the calculated network relations:

hidden_activations (t) = f (input (t))

output (t) = g (hidden_activations (t)) = g (f (input (t)))

After training through the batch_size size, the gradient of your loss function is calculated for each of the network parameters and your weight is updated.

However, in a recurrent neural network (RNN), your network works a little differently:

hidden_activations (t) = f (input (t), hidden_activations (t-1))

output (t) = g (hidden_activations (t)) = g (f (input (t), hidden_activations (t-1)))

= g (f (input (t), f (input (t-1), hidden_activations (t-2)))) = g (f (inp (t), f (inp (t-1), ... , f (inp (t = 0), hidden_initial_state))))

As you might have guessed about the meaning of naming, the network retains memory of its previous state, and the activation of neurons now also depends on the previous state of the network and the expansion of all the states that the network has ever been in. Most RNNs use the forgetfulness factor to give more importance later network conditions, but this does not apply to your question.

Then, since you can assume that calculating the gradients of the loss function with respect to network parameters is a very expensive calculation, if you need to consider going back through all the states from the moment your network was created, there is a neat little trick to speed up the calculation: approximate gradients with a subset historical states of the num_steps network.

If this conceptual discussion was not clear enough, you can also take a look at a more mathematical description of the above .

+4
source

I found this chart that helped me visualize the data structure.

Data structure

From the image, “lot size” is the number of examples of the sequence in which you want to train your RNN for that lot. "Values ​​in one time interval" is your data. (in my case, my RNN accepts 6 inputs), and finally, your time steps are the "length", so to speak, of the sequence that you train.

I also study repeating neural networks and how to prepare batches for one of my projects (and stumbled upon this thread trying to figure it out).

Packing for proactive and recurrent networks is slightly different, and when browsing different forums, the terminology for both rushes around and this becomes very confusing, so visualization is extremely useful.

Hope this helps.

+1
source
  • Batch size RNN - acceleration of calculations (since there are several bands in parallel computing units); This is not a mini pack for backpropagagation. An easy way to prove this is to play with different batch sizes, a RNN cell with batch size = 4 can be about 4 times faster than batch size = 1, and their loss is usually very close.

  • Regarding the RNN time steps, consider the following code snippets from rnn.py. static_rnn () calls the cell for each input_ at a time, and BasicRNNCell :: call () implements its direct part logic. In the case of text prediction, say, batch size = 8, we can think that the input here is 8 words from different sentences in a large text corpus, and not 8 consecutive words in a sentence. In my experience, we decide the meaning of time steps based on how deeply we would like to model “time” or “sequential dependence”. Again, to predict the next word in a text box with BasicRNNCell, small time steps may work. On the other hand, a large time step size can cause gradient blur problems.

    def static_rnn(cell, inputs, initial_state=None, dtype=None, sequence_length=None, scope=None): """Creates a recurrent neural network specified by RNNCell `cell`. The simplest form of RNN network generated is: state = cell.zero_state(...) outputs = [] for input_ in inputs: output, state = cell(input_, state) outputs.append(output) return (outputs, state) """ class BasicRNNCell(_LayerRNNCell): def call(self, inputs, state): """Most basic RNN: output = new_state = act(W * input + U * state + B). """ gate_inputs = math_ops.matmul( array_ops.concat([inputs, state], 1), self._kernel) gate_inputs = nn_ops.bias_add(gate_inputs, self._bias) output = self._activation(gate_inputs) return output, output 
  • Erik Hallström is worth a post to imagine how these two parameters relate to the dataset and weights. From this diagram and the above code snippets it is obvious that the batch size of the RNN does not affect the scales (wa, wb and b), but the “time steps”. Thus, it was possible to determine the time steps of the RNN on the basis of their problem and network model and the size of the batch of RNN based on the computing platform and data set.

+1
source

Source: https://habr.com/ru/post/1268577/


All Articles