Tensorflow is the right way to read data from one large txt file

I want to ask, what is the correct template for reading large text data in batches using the tensor flow method?

Here is one line of text data. There are billions of lines of such data in a single txt file.

target context label

Now I am trying to use tfrecords, as recommended in the official documentation.

here is my way

    filename_queue = tf.train.string_input_producer([self._train_data], num_epochs=self._num_epochs)

    reader = tf.TFRecordReader()

    _, serialized_example = reader.read(filename_queue)

    features = tf.parse_single_example(
        serialized_example,
        # Defaults are not specified since both keys are required.
        features={
            'target': tf.FixedLenFeature([], tf.int64),
            'context': tf.FixedLenFeature([], tf.int64),
            'label': tf.FixedLenFeature([], tf.int64),
        })

    target = features['target']
    context = features['context']
    label = features['label']
    min_after_dequeue = 10000
    capacity = min_after_dequeue + 3 * self._batch_size
    target_batch, context_batch, label_batch = tf.train.shuffle_batch(
        [target, context, label], batch_size=self._batch_size, capacity=capacity,
        min_after_dequeue=min_after_dequeue, num_threads=self._concurrent_steps)

After that, I used the graph for profiling. The result shows that this part takes up most of the time. Here is a profiling chart. profiling result

Btw. I use a batch size of 500. Any suggestions?

+4
source share
1 answer

tf.parse_example() , tf.parse_single_example() , op , , . :

filename_queue = tf.train.string_input_producer([self._train_data], num_epochs=self._num_epochs)

reader = tf.TFRecordReader()

# Read a batch of up to 128 examples at once.
_, serialized_examples = reader.read_up_to(filename_queue, 128)

features = tf.parse_example(
    serialized_examples,
    # Defaults are not specified since both keys are required.
    features={
        'target': tf.FixedLenFeature([], tf.int64),
        'context': tf.FixedLenFeature([], tf.int64),
        'label': tf.FixedLenFeature([], tf.int64),
    })

target = features['target']
context = features['context']
label = features['label']
min_after_dequeue = 10000
capacity = min_after_dequeue + 3 * self._batch_size

# Pass `enqueue_many=True` because the input is now a batch of parsed examples.
target_batch, context_batch, label_batch = tf.train.shuffle_batch(
    [target, context, label], batch_size=self._batch_size, capacity=capacity,
    min_after_dequeue=min_after_dequeue, num_threads=self._concurrent_steps,
    enqueue_many=True)
0

Source: https://habr.com/ru/post/1673923/


All Articles