Too long for a comment (maybe @mrry or @keveman might give an official definition), but here are a few notes:
- A logical device in TensorFlow is a computing unit with its own memory.
- The TensorFlow Scheduler adds Send / Recv operations to copy data to the proper device when the data crosses the boundaries of the cross devices.
- This is a logical device, so you can have more logical devices than physical devices (kernels), and some operating systems on available “devices” can be planned, but sit idle, waiting until the physical device is released. For CPU devices, you may have more threads than you have kernels, so the OS thread scheduler selects a subset of threads to start at any given time.
- The operation planned on the
tf.device("gpu:0") logical device tf.device("gpu:0") can store its data in the main memory (ie, the physical CPU device), so in practice the logical boundary of the device is sometimes violated. This is the HostMemory annotation that you see in ops, for example integer Add here . This allows you to perform operational operations, such as manipulating forms on the graphic processor of logical devices, and to avoid crossing the border of logical devices (Send / Recv ops), even if the data is not stored on the physical device of the GPU. - Using
device_count={"CPU": m}...intra_op_parallelism_threads=n creates several custom thread pools with n threads each, so you can manually split your schedule to run m ops in parallel, where each op will request threads n . However, you cannot run more threads simultaneously than you have physical cores, so this can be slow. - Logical devices such as
cpu:0 are not tied to specific kernels, so they can use any available kernels. - You can see what was the actual parallelism by looking at the time frame
Here is an example of creating 8 processors and simultaneously launching 2 mats: https://gist.github.com/yaroslavvb/9a5f4a0b613c79152152b35c0bc840b8
The design of the main graph is as follows
with tf.device("cpu:0"): a1 = tf.ones((n, n)) a2 = tf.ones((n, n)) with tf.device("cpu:1"): a3 = tf.matmul(a1, a2) with tf.device("cpu:2"): a4 = tf.matmul(a1, a2) with tf.device("cpu:3"): a5 = tf.matmul(a3, a4)
If you run gist, look at the run_metadata section graph section that you see, add Send/Recv ops, which transfer data between CPU devices, i.e. something like that
partition_graphs { node { name: "MatMul_1/_11" op: "_Recv" device: "/job:localhost/replica:0/task:0/cpu:3" attr { key: "client_terminated" value { b: false } } attr { key: "recv_device" value { s: "/job:localhost/replica:0/task:0/cpu:3" } } attr { key: "send_device" value { s: "/job:localhost/replica:0/task:0/cpu:2" } }
So, you see that there Send op plans to transfer data from cpu:2 to cpu:3 . Since all processors use shared memory access, this op does nothing, but may do something in the future if TensorFlow becomes aware of NUMA.
Alternatively, you can open timeline.json in a browser under chrome://tracing and see the time

You can see that it performs two parallel conversions of 1024x1024 in parallel, each of which takes about 85 ms, which is up to 25 M ops / second, which is suitable for the single-core performance of a 2-year-old macbook.
On the other hand, you can run 6 such matrix multiplications on 6 different processors, and you will see something like this.

I only have 4 physical cores, and you see that 2 of the operations take 2x longer. Despite the fact that they are active on the cpu logic device, some physical cores were available for the first 100 ms, so they did not advance.