I want to ask, what is the correct template for reading large text data in batches using the tensor flow method?
Here is one line of text data. There are billions of lines of such data in a single txt file.
target context label
Now I am trying to use tfrecords, as recommended in the official documentation.
here is my way
filename_queue = tf.train.string_input_producer([self._train_data], num_epochs=self._num_epochs)
reader = tf.TFRecordReader()
_, serialized_example = reader.read(filename_queue)
features = tf.parse_single_example(
serialized_example,
features={
'target': tf.FixedLenFeature([], tf.int64),
'context': tf.FixedLenFeature([], tf.int64),
'label': tf.FixedLenFeature([], tf.int64),
})
target = features['target']
context = features['context']
label = features['label']
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * self._batch_size
target_batch, context_batch, label_batch = tf.train.shuffle_batch(
[target, context, label], batch_size=self._batch_size, capacity=capacity,
min_after_dequeue=min_after_dequeue, num_threads=self._concurrent_steps)
After that, I used the graph for profiling. The result shows that this part takes up most of the time. Here is a profiling chart.
profiling result
Btw. I use a batch size of 500. Any suggestions?
source
share