TensorFlow: critical graph operations assigned to the processor, not gpu

I implemented the Dens TensorFlow DNN model (2 hidden layers with elu activation functions, trained by MNIST) as a Python class to wrap TF calls in another library using my own optimization routines and tools.

When performing some tests on the TeslaK20, I noticed that the GPU was used at 4% of the total capacity. Therefore, I came a little closer to the location of the log device and realized that all critical operations, such as MatMul , Sum , Add , Mean , etc., were assigned to the CPU.

The first thing that came to mind was that I used dtype=float64 , so I switched to dtype=float32 . Although much more operations were assigned to the GPU, still a large number were assigned to the processor, for example, Mean , gradient/Mean_grad/Prod , gradient/Mean .

So here is my first question (I link the working code example at the end),

1) why was this? I wrote different TF models that consist of simple tensor multiplications and abbreviations, and they fully work on the GPU, as long as I use the same precision.

So here is the second question,

2) why does TF assign a graph to different devices depending on the data type? I understand that not all cores are implemented for the GPU, but I would think that things like MatMul can work on the GPU for single and double precision.

3) Could the fact that the model is wrapped in a Python class have an effect? I do not think so, because, as I said, this did not happen for other models wrapped in a similar way, but it was easier.

4) What steps can I take to fully run the model on the GPU?

Here is a complete working example of my code that I extracted from my library

https://gist.github.com/smcantab/8ecb679150a327738102 .

If you run it and look at the output, you will see how different parts of the graph were assigned to different devices. To see how this changes with types and devices, change dtype and device within main() at the end of the example. Note that if I set allow_soft_placement=False , the graph is not initialized.

Any word of advice would be truly appreciated.

+2
source share
2 answers

As Yaroslav noted: The tool, in particular, has not yet been implemented for the GPU , but now it is available so that these operations are performed by the GPU with the latest TensorFlow. (according to registration DEVICE_GPU at this link)

Before the average value appears, the status:

(a) You can implement the mean manually because reduce_sum available on the GPU .

(b) I tried someone to see if there is an easy way to add GPU support, but we will see.

Re float64 on the GPU, someone discovered the problem three days ago with a patch for supporting the reduction of float64 on the GPU . Currently being tested and tested.

No, it doesnโ€™t matter if it is wrapped in Python - it really is only about whether the kernel was defined to run it on the GPU or not. In many cases, the answer to the question "why is X not supported on the GPU Y?" comes down to whether there was demand for Y on the GPU. The answer to float64 is simpler: float32 is much faster, so in most cases people work to ensure that their models work in float32 whenever possible, because they offer all the speed benefits.

+4
source

Most graphics cards, such as the GTX 980, 1080, etc., lack dual-precision floating-point hardware modules. Since they are much cheaper and therefore more ubiquitous than the new Tesla Units (which have FP64 dual-precision hardware), performing double-precision computing on graphics cards is very slow compared to single-precision. FP64 calculations on the GPU seem to be about 32 X slower than FP32 on the GPU without FP64 hardware. I believe that this is why FP32 calculations are usually configured for the GPU, and FP64 for the CPU (which is faster on most systems). Hopefully in the future, the infrastructure will test the capabilities of the GPU at run time to decide where to assign FP64 calculations.

0
source

Source: https://habr.com/ru/post/1243116/


All Articles