Tensorflow Tutorial: Duplicate Shuffle in the Inlet Piping

The Tensorflow Data Reading Tutorial provides an example input pipeline. In this pipeline, data is shuffled twice, inside string_input_producer , as well as in the shuffle batch generator . Here is the code:

 def input_pipeline(filenames, batch_size, num_epochs=None): # Fist shuffle in the input pipeline filename_queue = tf.train.string_input_producer( filenames, num_epochs=num_epochs, shuffle=True) example, label = read_my_file_format(filename_queue) min_after_dequeue = 10000 capacity = min_after_dequeue + 3 * batch_size # Second shuffle as part of the batching. # Requiring min_after_dequeue preloaded images example_batch, label_batch = tf.train.shuffle_batch( [example, label], batch_size=batch_size, capacity=capacity, min_after_dequeue=min_after_dequeue) return example_batch, label_batch 

Is a second shuffle used for any useful purpose? A drawback of the random play generator is that the min_after_dequeue examples min_after_dequeue always stored in memory, previously loaded into memory, to provide a useful shuffle. I have image data that is pretty heavy in memory consumption. This is why I consider using the normal batch generator instead. Is there any advantage to shuffling data twice?

Edit: An additional question. Why is string_input_producer initialized string_input_producer only default volume 32? Wouldn't it be beneficial to have multiple instances of batch_size as capacity?

+5
source share
3 answers

Yes - this is a general template, and it is shown in the most general way. string_input_producer moves the reading order of data files. Each data file usually contains many examples, for efficiency. (Reading a million small files is very slow, it’s better to read 1000 large files with 1000 examples each.)

Therefore, examples from files are read into the shuffle queue, where they are shuffled with much less granularity, so examples from the same file are not always trained in the same order and mixed by the input files.

For more details, see Getting good mixing with many input data files in a tensor stream.

If your files contain only one input example, you do not need to shuffle several times and can only leave with string_input_producer , but note that it will still be useful for you to have a queue containing several images after reading, so that you can overlap the input and training your network. queue_runner for a batch or shuffle_batch will execute in a separate thread, ensuring that I / O occurs in the background and that images are always available for training. And, of course, this is normal for the speed of creating mini-baht for training.

+6
source

Both shuffles serve different purposes and shuffle different things:

  • tf.train.string_input_producer shuffle: Boolean. If true, lines will be randomly shuffled in each era. Therefore, if you have several files ['file1', 'file2', ..., 'filen'] , this randomly selects a file from this list. If the case is false, the files follow one after another.
  • tf.train.shuffle_batch Creates batches by randomly shuffling tensors. So you take batch_size tensors from your read_my_file_format queue and shuffle them.

Since both shuffles do different things, there is an advantage to shuffling data twice. Even if you use a batch of 256 images, and each image is 256x256 pixels in size, you will consume less than 100 MB of memory. If at some point you see memory problems, you can try to reduce the batch size.

Regarding default capacity, this is a specific model . It makes sense to have it more than batch_size and make sure it is never empty during training.

0
source

To answer an additional question, string_input_producer returns a queue containing the file name containing the samples, not the samples themselves. These file names are then used by shuffle_batch to load data. Thus, the number of loaded samples is associated with the capacity parameter of the shuffle_batch function, and not string_input_producer .

0
source

Source: https://habr.com/ru/post/1238714/


All Articles