Multi-GPU architecture, gradient averaging - a less accurate model?

When I execute the cifar10 model as described in https://www.tensorflow.org/tutorials/deep_cnn , I achieve 86% accuracy after about 4 hours using one GPU, when I use 2 GPUs, the accuracy drops to 84%, but achieving 84% accuracy is faster by 2 GPUs than 1.

My intuition is that the average_gradients function, defined in https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py , returns a less accurate gradient value, since the average value of the gradients will be less accurate than the actual gradient value.

If the gradients are less accurate, then the parameters than the control of the function that is being studied as part of the training are less accurate. Looking at the code ( https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10/cifar10_multi_gpu_train.py ), why is averaging of gradients on several GPUs less accurate than calculating the gradient on a single GPU?

Is my intuition averaging gradients that create a less accurate value?

Randomness in the model is described as:

The images are processed as follows:
They are cropped to 24 x 24 pixels, centrally for evaluation or randomly for training.
They are approximately whitened to make the model insensitive to dynamic range.
For training, we additionally apply a series of random distortions to artificially increase the data set size:

Randomly flip the image from left to right.
Randomly distort the image brightness.
Randomly distort the image contrast.

src: https://www.tensorflow.org/tutorials/deep_cnn

Does this affect the accuracy of the training?

Update:

Trying to explore this further, learning the value of the loss function with a different number of GPUs.

Training with 1 GPU : loss value : .7 , Accuracy : 86%
Training with 2 GPU : loss value : .5 , Accuracy : 84%

Should the loss value be lower for higher for higher accuracy, and not vice versa?

+4
3

, , average_gradient (1), 1 .

:

grad = tf.concat(axis=0, values=grads)
grad = tf.reduce_mean(grad, 0)

( ) - .

, (1) 1-GPU 2-GPU , . ( , , )

, . , , n - 1-GPU, GPU - , . , -

print sess.run(lr)

.

(1) , , , .

+2

( ). , SGD, - . , SGD SGD , , . , .

[Zhang et. al., 2015] SGD, - SGD. , . , , .

: , , . -, , - softmax ( deep_cnn, ), . . .

, ( ), . , ( ) , , , , . / (, ,) , , , , , " 1 GPU" , ".

-, , ( ), , . , - , , .

, , , , , , , , , , .

+2

. . , Minibatch SGD: Training ImageNet 1 Facebook, . - k ( ) k .

In fact, I found that simply adding gradients from the GPUs (rather than averaging them) and using the original learning speed sometimes also does the job.

0
source

Source: https://habr.com/ru/post/1676648/


All Articles