Based on the NVIDIA Documentation I am running a benchmark with FP16 (TensorCore). For this, I modyfied alexnet_benchmark delivered by tensor stream: https://gist.github.com/melgor/946b9643aa25dd3839a86804fc580741
Overall, AlexNet is 35% faster, not so much. I was hoping to get ~ 2x faster. Also, maybe Resnet will make more difference. The best part is that I can put the model with batch_size = 5120 (fp32 cannot), one pass of the FB takes 0.653, so ImageNet training on 90 eras will take ~ 4 hours.
batch_size=512 alexnet_fp32: Forward-backward across 100 steps, 0.099 +/- 0.000 sec / batch alexnet_fp16: Forward-backward across 100 steps, 0.064 +/- 0.000 sec / batch
Edit:
I managed to run ResNet models on FP16 (but without BatchNorm, for some reason BN does not work with fp16):
batch_size=256 resnet50_fp32: Forward-backward across 100 steps, 0.575 +/- 0.001 sec / batch resnet50_fp16: Forward-backward across 100 steps, 0.504 +/- 0.001 sec / batch
batch_size=128 resnet152_fp32: Forward-backward across 100 steps, 0.757 +/- 0.001 sec / batch resnet152_fp16: Forward-backward across 100 steps, 0.581 +/- 0.010 sec / batch
The gain in ResNet is even less. It looks like the FP16 doesn't have a big win on the V100, not knowing why. TensorCore support may not be fully integrated at this time.
source share