Mixed precision not included with TF1.4 on Tesla V100

I was interested in testing my neural network (Autoencoder, which serves as the + CNN generator as a discriminator), which uses 3dconv / deconv layers with the new Volta architecture and benefits from Mixed-Precision training. I compiled the latest Tensorflow 1.4 source code with CUDA 9 and CudNN 7.0 and passed all the trained variables used by my conv / deconv levels to tf.float16. In addition, all of my input and output tensors are multiples of 8.

Unfortunately, I do not see a significant speed improvement with this configuration, the training time is approximately similar to using tf.float32. I understand that with the architecture of Volta and cuDNN 7.0 Mixed Precision, TF should be automatically detected and therefore allow the use of Tensor Core math. Am I mistaken, or should I do something to turn it on? I also tried building TF1.5 nighlty, and it seems to be even slower than my custom 1.4.

I would appreciate it if any of the developers involved in Tensorflow could answer this question.

EDIT: After talking with NVIDIA technical support, it seems that by supporting float16, TF integrates mixed precision acceleration for simple 2D codes, but not for Ops 3D codes.

+5
source share
4 answers

Based on the NVIDIA Documentation I am running a benchmark with FP16 (TensorCore). For this, I modyfied alexnet_benchmark delivered by tensor stream: https://gist.github.com/melgor/946b9643aa25dd3839a86804fc580741

Overall, AlexNet is 35% faster, not so much. I was hoping to get ~ 2x faster. Also, maybe Resnet will make more difference. The best part is that I can put the model with batch_size = 5120 (fp32 cannot), one pass of the FB takes 0.653, so ImageNet training on 90 eras will take ~ 4 hours.

batch_size=512 alexnet_fp32: Forward-backward across 100 steps, 0.099 +/- 0.000 sec / batch alexnet_fp16: Forward-backward across 100 steps, 0.064 +/- 0.000 sec / batch

Edit:

I managed to run ResNet models on FP16 (but without BatchNorm, for some reason BN does not work with fp16):

batch_size=256 resnet50_fp32: Forward-backward across 100 steps, 0.575 +/- 0.001 sec / batch resnet50_fp16: Forward-backward across 100 steps, 0.504 +/- 0.001 sec / batch

batch_size=128 resnet152_fp32: Forward-backward across 100 steps, 0.757 +/- 0.001 sec / batch resnet152_fp16: Forward-backward across 100 steps, 0.581 +/- 0.010 sec / batch

The gain in ResNet is even less. It looks like the FP16 doesn't have a big win on the V100, not knowing why. TensorCore support may not be fully integrated at this time.

+3
source

I am very interested in this topic, does anyone have any updated data on the current state of integration of Volta Tensor Cores with Tensorflow? I conducted speed test experiments with the Volta V100 GPU and the 1.5 cuda 9.0 cudnn tensor and came to the following conclusions:

  • Training with the Volta V100 is no faster than getting ready to work with the GeForce 1080 Ti, while it should be significantly faster. Using float16 or float32 doesn't change anything
  • Training with the Volta V100 with float 16 is no faster than training with the Volta V100 with float32. The volta GPUs are supposed to be optimized for float16, so I expected a significant speed improvement.

So basically I had the same conclusions as the OP: Volta GPUs are not yet fully supported by Tensorflow.

This PR for github tensorflow seems to be relevant, although I have not tested these changes yet: https://github.com/tensorflow/tensorflow/pull/16253

+1
source

I believe that tensor flow does not use the correct cudnn API calls to determine the best algorithms. I just copied the tensor flow code for cudnnGetConvolutionForwardAlgorithm_v7 as well as cudnnFindConvolutionForwardAlgorithmEx

and no matches. I am going to raise a ticket using Tensorflow.

0
source

Yesterday, Nvidia showed off Automatic Mixed Precision, which greatly facilitates the implementation of this feature, significantly reducing effort. It seems that the webcast has been recorded and will be available upon request, but for now the links are here:

https://developer.nvidia.com/automatic-mixed-precision

https://devblogs.nvidia.com/nvidia-automatic-mixed-precision-tensorflow/

In addition, there is an excellent article that talks about the implementation (among other things) of Mixed Precision. I prepared this 4-minute video "AI Supercharging with High Performance Distributed Computing"

http://youtu.be/JvssZESVcjI )

this sums up the study.

0
source

Source: https://habr.com/ru/post/1273188/


All Articles