Tensorflow GPU usage only 60% (GTX 1070)

I am training a CNN model with tensor flow. I use only 60% GPU (+ - 2-3%) without big drops.

Sun Oct 23 11:34:26 2016 +-----------------------------------------------------------------------------+ | NVIDIA-SMI 367.57 Driver Version: 367.57 | |-------------------------------+----------------------+----------------------+ | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. | |===============================+======================+======================| | 0 GeForce GTX 1070 Off | 0000:01:00.0 Off | N/A | | 1% 53C P2 90W / 170W | 7823MiB / 8113MiB | 60% Default | +-------------------------------+----------------------+----------------------+ +-----------------------------------------------------------------------------+ | Processes: GPU Memory | | GPU PID Type Process name Usage | |=============================================================================| | 0 3644 C /usr/bin/python2.7 7821MiB | +-----------------------------------------------------------------------------+ 

Since this is a Pascal card, I use CUDA 8 with cudnn 5.1.5 CPU usage is about 50% (evenly distributed across 8 threads. I7 4770k), so processor performance should not be a bottleneck.

I use the Tensorflow binary file format and read with tf.TFRecordReader()

I create batches of images as follows:

 #Uses tf.TFRecordReader() to read single Example label, image = read_and_decode_single_example(filename_queue=filename_queue) image = tf.image.decode_jpeg(image.values[0], channels=3) jpeg = tf.cast(image, tf.float32) / 255. jpeg.set_shape([66,200,3]) images_batch, labels_batch = tf.train.shuffle_batch( [jpeg, label], batch_size= FLAGS.batch_size, num_threads=8, capacity=2000, #tried bigger values here, does not change the performance min_after_dequeue=1000) #here too 

Here is my workout cycle:

 sess = tf.Session() sess.run(init) tf.train.start_queue_runners(sess=sess) for step in xrange(FLAGS.max_steps): labels, images = sess.run([labels_batch, images_batch]) feed_dict = {images_placeholder: images, labels_placeholder: labels} _, loss_value = sess.run([train_op, loss], feed_dict=feed_dict) 

I don’t have much experience with tensor flow, and I don’t know where the bottleneck could be. If you need additional code snippets to help identify a problem, I have provided them.

UPDATE: bandwidth test results

 ==5172== NVPROF is profiling process 5172, command: ./bandwidthtest Device: GeForce GTX 1070 Transfer size (MB): 3960 Pageable transfers Host to Device bandwidth (GB/s): 7.066359 Device to Host bandwidth (GB/s): 6.850315 Pinned transfers Host to Device bandwidth (GB/s): 12.038037 Device to Host bandwidth (GB/s): 12.683915 ==5172== Profiling application: ./bandwidthtest ==5172== Profiling result: Time(%) Time Calls Avg Min Max Name 50.03% 933.34ms 2 466.67ms 327.33ms 606.01ms [CUDA memcpy DtoH] 49.97% 932.32ms 2 466.16ms 344.89ms 587.42ms [CUDA memcpy HtoD] ==5172== API calls: Time(%) Time Calls Avg Min Max Name 46.60% 1.86597s 4 466.49ms 327.36ms 606.15ms cudaMemcpy 35.43% 1.41863s 2 709.31ms 632.94ms 785.69ms cudaMallocHost 17.89% 716.33ms 2 358.17ms 346.14ms 370.19ms cudaFreeHost 0.04% 1.5572ms 1 1.5572ms 1.5572ms 1.5572ms cudaMalloc 0.02% 708.41us 1 708.41us 708.41us 708.41us cudaFree 0.01% 203.58us 1 203.58us 203.58us 203.58us cudaGetDeviceProperties 0.00% 187.55us 1 187.55us 187.55us 187.55us cuDeviceTotalMem 0.00% 162.41us 91 1.7840us 105ns 61.874us cuDeviceGetAttribute 0.00% 79.979us 4 19.994us 1.9580us 73.537us cudaEventSynchronize 0.00% 77.074us 8 9.6340us 1.5860us 28.925us cudaEventRecord 0.00% 19.282us 1 19.282us 19.282us 19.282us cuDeviceGetName 0.00% 17.891us 4 4.4720us 629ns 8.6080us cudaEventDestroy 0.00% 16.348us 4 4.0870us 818ns 8.8600us cudaEventCreate 0.00% 7.3070us 4 1.8260us 1.7040us 2.0680us cudaEventElapsedTime 0.00% 1.6670us 3 555ns 128ns 1.2720us cuDeviceGetCount 0.00% 813ns 3 271ns 142ns 439ns cuDeviceGet 
+5
source share
2 answers

After gaining experience with tensor flow, I realized that using a GPU is highly dependent on network size, batch size and preprocessing. The use of a larger network with a large number of conv layers (for example, the Resnet style) increases the use of the GPU, since more calculations are involved and less cost (relative to the calculation) is made by transmitting data, etc.

+1
source

One potential bottleneck is the use of the PCI Express bus between the processor and the GPU when uploading images to the GPU. You can use some tools to measure it .

Another potential bottleneck is the IO disk, I don’t see anything in your code that could cause it, but it is always useful to keep track of it.

0
source

Source: https://habr.com/ru/post/1258659/


All Articles