I implemented the Dens TensorFlow DNN model (2 hidden layers with elu activation functions, trained by MNIST) as a Python class to wrap TF calls in another library using my own optimization routines and tools.
When performing some tests on the TeslaK20, I noticed that the GPU was used at 4% of the total capacity. Therefore, I came a little closer to the location of the log device and realized that all critical operations, such as MatMul , Sum , Add , Mean , etc., were assigned to the CPU.
The first thing that came to mind was that I used dtype=float64 , so I switched to dtype=float32 . Although much more operations were assigned to the GPU, still a large number were assigned to the processor, for example, Mean , gradient/Mean_grad/Prod , gradient/Mean .
So here is my first question (I link the working code example at the end),
1) why was this? I wrote different TF models that consist of simple tensor multiplications and abbreviations, and they fully work on the GPU, as long as I use the same precision.
So here is the second question,
2) why does TF assign a graph to different devices depending on the data type? I understand that not all cores are implemented for the GPU, but I would think that things like MatMul can work on the GPU for single and double precision.
3) Could the fact that the model is wrapped in a Python class have an effect? I do not think so, because, as I said, this did not happen for other models wrapped in a similar way, but it was easier.
4) What steps can I take to fully run the model on the GPU?
Here is a complete working example of my code that I extracted from my library
https://gist.github.com/smcantab/8ecb679150a327738102 .
If you run it and look at the output, you will see how different parts of the graph were assigned to different devices. To see how this changes with types and devices, change dtype and device within main() at the end of the example. Note that if I set allow_soft_placement=False , the graph is not initialized.
Any word of advice would be truly appreciated.