Currently, TensorFlow uses only one stream of calculations and several streams of copy. Some cores may use multiple threads for computation while maintaining single-threaded semantics.
Our experiment showed that the inclusion of multi-threaded automatically does not bring a big gain in performance, since most of our cores are large enough to use all the processors in the GPU. But enabling multi-stream will disable our current project to aggressively recycle GPU memory.
This is a decision that we can reconsider in the future. If this happens, most likely TensorFlow will automatically assign different Cuda threads to the operations / kernels without exposing them to users.
source share