HowTo Benchmark: Reading Data

I use tensorflow 0.10 and I compared the examples found in the official instructions examples / tutorials / mnist / fully_connected_feed.py

Reading from files : examples / how_tos / reading_data / convert_to_records.py and examples / how_tos / reading_data / fully_connected_reader.py Preloaded data (constant) : examples / how_tos / reading_data / fully_connected_preloaded.py Preloaded data (variable) : examples / how_tos / reading_data / fully_connected_preloaded_var.py

I ran these scripts unmodified, with the exception of the last two, because they break - at least into version 0.10 - unless I add extra sess.run(tf.initialize_local_variables()) .

Main question

Runtime of 100 mini-lots out of 100 examples running on the GTX1060:

  • Feeding : ~0.001 s
  • Reading from files : ~0.010 s
  • Preloaded data (constant) : ~0.010 s
  • Preloaded data (variable) : ~0.010 s

Those results are pretty amazing to me. I would expect Feeding be the slowest since it is pretty much everything in python, while other methods use the lower level of tensorflow / C ++ to perform similar operations. This is the exact opposite of what I expected. Does anyone understand what is going on?

Secondary question

I have access to another machine that has drivers for Titan X and older than NVidia. The relative results were approximately in line with the above, with the exception of Preloaded data (constant) , which was disastrously slow, taking many seconds for one mini-batch.

Is it known that performance can be very different from hardware / drivers?

+6
source share
3 answers

Update Oct 9 slows down because the calculation is too fast for Python to pre-flush the computation flow and schedule pre-fetching threads. The calculation in the main stream takes 2 ms and is apparently too small for prefetching to capture the GIL. Prefetching a stream has a large delay and therefore can always be prevented by a stream of calculations. Thus, the computational flow goes through all the examples, and then spends most of the time on the GIL, as some prefetching thread gets scheduled and completes one example. The solution is to increase the number of Python threads, increase the size of the queue to fit the entire dataset, start the queue of queues, and then pause the main thread for a couple of seconds to allow the queues to pre-fill the queue.

Old stuff

This is surprisingly slow.

This looks like a special case, making the last 3 examples unnecessarily slow (most of the effort was to optimize large models such as ImageNet, so MNIST ignored it).

You can diagnose problems by receiving timelines as described here.

Here are from these examples with an included chronology set.

Here's the timeline of the feed_dict implementation

enter image description here

It is important to note that matmul takes a good chunk of time, so the reading overhead is negligible

Now here is the runtime reader enter image description here

You can see that this operation is a bottleneck in QueueDequeueMany, which takes a whopping 45 ms.

If you zoom in, you will see a bunch of tiny MEMCPY and Cast operations, which is a sign that some operating processor is only a processor ( parse_single_example ), and dequeue should schedule several independent transfers CPU-> GPU

In the var example below with the GPU disabled, I don't see tiny operations, but QueueDequeueMany still takes more than 10 ms. Time seems to scale linearly with batch size, so there is some fundamental slowness. Filed # 4740 enter image description here

+5
source

Yaroslav copes with the problem. With small models, you will need to speed up data import. One way to do this is with the Tensorflow function, tf.TFRecordReader.read_up_to , which reads multiple entries in each session.run() call, thereby removing the overhead caused by multiple calls.

 enqueue_many_size = SOME_ENQUEUE_MANY_SIZE reader = tf.TFRecordReader(options = tf.python_io.TFRecordOptions(tf.python_io.TFRecordCompressionType.ZLIB)) _, queue_batch = reader.read_up_to(filename_queue, enqueue_many_size) batch_serialized_example = tf.train.shuffle_batch( [queue_batch], batch_size=batch_size, num_threads=thread_number, capacity=capacity, min_after_dequeue=min_after_dequeue, enqueue_many=True) 

It has also been covered in this question by fooobar.com/questions/844123 / ....

+2
source

The main question: why is the example with preloaded data (constant) examples / how_tos / reading_data / fully_connected_preloaded.py much slower than other examples of data loading code when using the GPU .

I had the same problem that fully_connected_preloaded.py unexpectedly slowing down on my Titan X. The problem was that the entire dataset was preloaded on the CPU and not on the GPU.

First let me share my initial attempts. I applied the following recommendations of Yaroslav.

  • set capacity=55000 for tf.train.slice_input_producer . (55000 is the size of the MNIST workout in my case)
  • set num_threads=5 for tf.train.batch .
  • set capacity=500 for tf.train.batch .
  • enter time.sleep(10) after tf.train.start_queue_runners .

However, the average speed for each batch remains the same. I tried timeline visualization for profiling and still got the dominant QueueDequeueManyV2 .

The problem was line 65 fully_connected_preloaded.py . The following code loads the entire data set into the CPU, while still providing a bottleneck for transmitting CPU-GPU data.

 with tf.device('/cpu:0'): input_images = tf.constant(data_sets.train.images) input_labels = tf.constant(data_sets.train.labels) 

Therefore, I switched the highlight of the device.

 with tf.device('/gpu:0') 

Then I got x100 acceleration for each batch.

Note:

  • This was possible because the Titan X has enough memory to preload the entire data set.
  • In the source code ( fully_connected_preloaded.py ), the comment on line 64 says that "the rest of the pipeline has only a CPU." I'm not sure what this is up to.
+1
source

Source: https://habr.com/ru/post/1264324/