For my project, I have a large amount of data, about 60 GB applies to npy files, each of which contains about 1 GB, each of which contains about 750 thousand records and tags.
Each record is 345 float32, and tags are 5 float32.
I read the tensorflow dataset documentation and the queue / thread documentation, but I cannot figure out how best to process the input for training, and then how to save the model and weight for future forecasting.
My model is pretty straight forward, it looks like this:
x = tf.placeholder(tf.float32, [None, 345], name='x') y = tf.placeholder(tf.float32, [None, 5], name='y') wi, bi = weight_and_bias(345, 2048) hidden_fc = tf.nn.sigmoid(tf.matmul(x, wi) + bi) wo, bo = weight_and_bias(2048, 5) out_fc = tf.nn.sigmoid(tf.matmul(hidden_fc, wo) + bo) loss = tf.reduce_mean(tf.squared_difference(y, out_fc)) train_op = tf.train.AdamOptimizer().minimize(loss)
As I trained my neural network, I read the files one at a time in random order, and then used the shuffled numpy array to index each file and manually created each batch to feed train_op using feed_dict . From all that I read, it is very inefficient, and I have to somehow replace it with data sets or queues and threads, but, as I said, the documentation did not help.
So what is the best way to handle large amounts of data in a tensor stream?
In addition, for reference, my data was saved in a numpy file with a step of 2 operations:
with open('datafile1.npy', 'wb') as fp: np.save(data, fp) np.save(labels, fp)