How to process a large amount of data in a tensor stream?

For my project, I have a large amount of data, about 60 GB applies to npy files, each of which contains about 1 GB, each of which contains about 750 thousand records and tags.

Each record is 345 float32, and tags are 5 float32.

I read the tensorflow dataset documentation and the queue / thread documentation, but I cannot figure out how best to process the input for training, and then how to save the model and weight for future forecasting.

My model is pretty straight forward, it looks like this:

x = tf.placeholder(tf.float32, [None, 345], name='x') y = tf.placeholder(tf.float32, [None, 5], name='y') wi, bi = weight_and_bias(345, 2048) hidden_fc = tf.nn.sigmoid(tf.matmul(x, wi) + bi) wo, bo = weight_and_bias(2048, 5) out_fc = tf.nn.sigmoid(tf.matmul(hidden_fc, wo) + bo) loss = tf.reduce_mean(tf.squared_difference(y, out_fc)) train_op = tf.train.AdamOptimizer().minimize(loss) 

As I trained my neural network, I read the files one at a time in random order, and then used the shuffled numpy array to index each file and manually created each batch to feed train_op using feed_dict . From all that I read, it is very inefficient, and I have to somehow replace it with data sets or queues and threads, but, as I said, the documentation did not help.

So what is the best way to handle large amounts of data in a tensor stream?

In addition, for reference, my data was saved in a numpy file with a step of 2 operations:

 with open('datafile1.npy', 'wb') as fp: np.save(data, fp) np.save(labels, fp) 
+5
source share
1 answer

Utilities for npy files really allocate the entire array in memory. I would recommend that you convert all your numpy arrays to TFRecords format and use these files for training. This is one of the most effective ways to read a large dataset in a tensor stream.

Convert to TFRecords

 def array_to_tfrecords(X, y, output_file): feature = { 'X': tf.train.Feature(float_list=tf.train.FloatList(value=X.flatten())), 'y': tf.train.Feature(float_list=tf.train.FloatList(value=y.flatten())) } example = tf.train.Example(features=tf.train.Features(feature=feature)) serialized = example.SerializeToString() writer = tf.python_io.TFRecordWriter(output_file) writer.write(serialized) writer.close() 

A complete example regarding images can be found here .

Read TFRecordDataset

 def parse_proto(example_proto): features = { 'X': tf.FixedLenFeature((345,), tf.float32), 'y': tf.FixedLenFeature((5,), tf.float32), } parsed_features = tf.parse_single_example(example_proto, features) return parsed_features['X'], parsed_features['y'] def read_tfrecords(file_names=("file1.tfrecord", "file2.tfrecord", "file3.tfrecord"), buffer_size=10000, batch_size=100): dataset = tf.contrib.data.TFRecordDataset(file_names) dataset = dataset.map(parse_proto) dataset = dataset.shuffle(buffer_size) dataset = dataset.repeat() dataset = dataset.batch(batch_size) return tf.contrib.data.Iterator.from_structure(dataset.output_types, dataset.output_shapes) 

An instruction manual can be found here .

0
source

Source: https://habr.com/ru/post/1272734/


All Articles