Distributed Server Parameter Parameters and Parameters

I closely followed the examples of Imagenet trains with distributed TF.

I can’t understand how data dissemination occurs when this example runs on two different workers? Theoretically, different workers should see a different piece of data. Also, what part of the code tells the parameters to pass the parameter server? As with the multiple gpu example, there is an explicit section for "cpu: 0".

+3
source share
1 answer

Different workers see different pieces of data as a result of the removal of mini-batch images from a single queue of pre-processed images. To develop in a distributed setup for training the Imagenet model, input images are pre-processed by several threads, and pre-processed images are stored in one RandomShuffleQueue. You can find tf.RandomShuffleQueuein this file , to see how it's done. Several workers are organized as “Initial Towers,” and each tower cancels a mini-batch of images from a single queue and thus receives different input parts. The image here answers the second part of your question. Locate slim.variables.VariableDeviceChooserin this file . The logic there ensures that objectsVariableare assigned evenly to workers who act as a parameter server. All other employees performing actual training select the variables at the beginning of the step and update them at the end of the step.

+4
source

Source: https://habr.com/ru/post/1662871/


All Articles