The batch size depends on the number of training samples that must be considered when updating the network weight. So, suppose you want to update your network loads based on calculating your gradients one word at a time, your batch_size = 1. Since gradients are calculated from a single sample, it is computationally very cheap. On the other hand, it is also very erratic learning.
To understand what happens during the training of such a direct access network, I will give you this very nice visual example of single_batch versus mini_batch for training single_sample .
However, you want to understand what happens to your num_steps variable. This is not the same as your batch_size. As you may have noticed, so far I have mentioned direct networks. In a network with straight lines, the output is determined from the network inputs, and the input-output relationship is displayed through the calculated network relations:
hidden_activations (t) = f (input (t))
output (t) = g (hidden_activations (t)) = g (f (input (t)))
After training through the batch_size size, the gradient of your loss function is calculated for each of the network parameters and your weight is updated.
However, in a recurrent neural network (RNN), your network works a little differently:
hidden_activations (t) = f (input (t), hidden_activations (t-1))
output (t) = g (hidden_activations (t)) = g (f (input (t), hidden_activations (t-1)))
= g (f (input (t), f (input (t-1), hidden_activations (t-2)))) = g (f (inp (t), f (inp (t-1), ... , f (inp (t = 0), hidden_initial_state))))
As you might have guessed about the meaning of naming, the network retains memory of its previous state, and the activation of neurons now also depends on the previous state of the network and the expansion of all the states that the network has ever been in. Most RNNs use the forgetfulness factor to give more importance later network conditions, but this does not apply to your question.
Then, since you can assume that calculating the gradients of the loss function with respect to network parameters is a very expensive calculation, if you need to consider going back through all the states from the moment your network was created, there is a neat little trick to speed up the calculation: approximate gradients with a subset historical states of the num_steps network.
If this conceptual discussion was not clear enough, you can also take a look at a more mathematical description of the above .