When we train the network, when we calculate the direct pass, we must save all the intermediate activation outputs for the reverse pass. You just need to calculate how much memory you will need to store all the corresponding activation outputs in your transition forward, in addition to other memory limitations (saving your weights on the GPU, etc.). Therefore, note that if your network is quite deep, you may want to take a smaller packet size, as you may run out of memory.
Chip size selection is a mixture of memory and performance / accuracy limitations (usually evaluated using cross-validation).
I personally guess / calculate / calculate manually how much GPU memory my forward / reverse will use and try a few values. If, for example, the largest that I can pick up is about 128, I can double-check the check using 32, 64, 96, etc., Just to be thorough and see if I can improve performance. This is usually for a deeper network that is going to push my GPU memory (I also only have a 4 GB card, I do not have access to the NVIDIA monster cards).
I think that, as a rule, more attention is paid to network architecture, optimization methods / trading tricks, data preprocessing.
source share