What is a device in TensorFlow?

It would be very useful for me to have a clear definition of what exactly such a device is in TensorFlow. Is the device a single processor (no "real" concurrency)?

You can identify as many devices as you want by following these steps:

config = tf.ConfigProto(device_count={"CPU": 2}, inter_op_parallelism_threads=2, intra_op_parallelism_threads=1) sess = tf.Session(config=config) 

How is it possible, you can identify as many devices as you want, despite having only one processor with 4 cores?

+5
source share
1 answer

Too long for a comment (maybe @mrry or @keveman might give an official definition), but here are a few notes:

  • A logical device in TensorFlow is a computing unit with its own memory.
  • The TensorFlow Scheduler adds Send / Recv operations to copy data to the proper device when the data crosses the boundaries of the cross devices.
  • This is a logical device, so you can have more logical devices than physical devices (kernels), and some operating systems on available “devices” can be planned, but sit idle, waiting until the physical device is released. For CPU devices, you may have more threads than you have kernels, so the OS thread scheduler selects a subset of threads to start at any given time.
  • The operation planned on the tf.device("gpu:0") logical device tf.device("gpu:0") can store its data in the main memory (ie, the physical CPU device), so in practice the logical boundary of the device is sometimes violated. This is the HostMemory annotation that you see in ops, for example integer Add here . This allows you to perform operational operations, such as manipulating forms on the graphic processor of logical devices, and to avoid crossing the border of logical devices (Send / Recv ops), even if the data is not stored on the physical device of the GPU.
  • Using device_count={"CPU": m}...intra_op_parallelism_threads=n creates several custom thread pools with n threads each, so you can manually split your schedule to run m ops in parallel, where each op will request threads n . However, you cannot run more threads simultaneously than you have physical cores, so this can be slow.
  • Logical devices such as cpu:0 are not tied to specific kernels, so they can use any available kernels.
  • You can see what was the actual parallelism by looking at the time frame

Here is an example of creating 8 processors and simultaneously launching 2 mats: https://gist.github.com/yaroslavvb/9a5f4a0b613c79152152b35c0bc840b8

The design of the main graph is as follows

 with tf.device("cpu:0"): a1 = tf.ones((n, n)) a2 = tf.ones((n, n)) with tf.device("cpu:1"): a3 = tf.matmul(a1, a2) with tf.device("cpu:2"): a4 = tf.matmul(a1, a2) with tf.device("cpu:3"): a5 = tf.matmul(a3, a4) 

If you run gist, look at the run_metadata section graph section that you see, add Send/Recv ops, which transfer data between CPU devices, i.e. something like that

 partition_graphs { node { name: "MatMul_1/_11" op: "_Recv" device: "/job:localhost/replica:0/task:0/cpu:3" attr { key: "client_terminated" value { b: false } } attr { key: "recv_device" value { s: "/job:localhost/replica:0/task:0/cpu:3" } } attr { key: "send_device" value { s: "/job:localhost/replica:0/task:0/cpu:2" } } 

So, you see that there Send op plans to transfer data from cpu:2 to cpu:3 . Since all processors use shared memory access, this op does nothing, but may do something in the future if TensorFlow becomes aware of NUMA.

Alternatively, you can open timeline.json in a browser under chrome://tracing and see the time

enter image description here

You can see that it performs two parallel conversions of 1024x1024 in parallel, each of which takes about 85 ms, which is up to 25 M ops / second, which is suitable for the single-core performance of a 2-year-old macbook.

On the other hand, you can run 6 such matrix multiplications on 6 different processors, and you will see something like this.

enter image description here

I only have 4 physical cores, and you see that 2 of the operations take 2x longer. Despite the fact that they are active on the cpu logic device, some physical cores were available for the first 100 ms, so they did not advance.

+9
source

Source: https://habr.com/ru/post/1258263/


All Articles