Tensorflow Distributed Hosting in Google Cloud ML Engine

Question

Tensorflow Distributed Hosting in Google Cloud ML Engine

I am launching a large distributed Tensorflow model in the Google cloud ML engine. I want to use machines with GPUs. My chart consists of two main parts: data input / read function and computing part.

I want to place the variables in the PS task, the input part in the CPU and the computing part on the GPU. The tf.train.replica_device_setter function automatically places the variables in the PS server.

This is what my code looks like:

 with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)): input_tensors = model.input_fn(...) output_tensors = model.model_fn(input_tensors, ...)

Is it possible to use tf.device() together with replica_device_setter() , as in:

 with tf.device(tf.train.replica_device_setter(cluster=cluster_spec)): with tf.device('/cpu:0') input_tensors = model.input_fn(...) with tf.device('/gpu:0') tensor_dict = model.model_fn(input_tensors, ...)

Will replica_divice_setter() be overridden and the variables are not placed on the PS server?

Also, since the device names in the cluster are similar to job:master/replica:0/task:0/gpu:0 , how can I tell Tensorflow tf.device(whatever/gpu:0) ?

+5

gpu tensorflow distributed-computing multi-gpu google-cloud-ml

Miguel Dec 13 '17 at 10:59

source share

1 answer

Maxim · Accepted Answer · 2017-12-13T12:08:55+0000

Any operations, except variables, in the tf.train.replica_device_setter block tf.train.replica_device_setter automatically bound to "/job:worker" , by default this will be the first device controlled by the first task in the "work" task.

You can bind them to another device (or task) using the built-in device block:

 with tf.device(tf.train.replica_device_setter(ps_tasks=2, ps_device="/job:ps", worker_device="/job:worker")): v1 = tf.Variable(1., name="v1") # pinned to /job:ps/task:0 (defaults to /cpu:0) v2 = tf.Variable(2., name="v2") # pinned to /job:ps/task:1 (defaults to /cpu:0) v3 = tf.Variable(3., name="v3") # pinned to /job:ps/task:0 (defaults to /cpu:0) s = v1 + v2 # pinned to /job:worker (defaults to task:0/cpu:0) with tf.device("/task:1"): p1 = 2 * s # pinned to /job:worker/task:1 (defaults to /cpu:0) with tf.device("/cpu:0"): p2 = 3 * s # pinned to /job:worker/task:1/cpu:0

Tensorflow Distributed Hosting in Google Cloud ML Engine

More articles: