I am creating a tensor flow model where the input is a large scipy sparse matrix, each row is a sample of dimension> 50k, where only a few hundred values ββare nonzero.
Currently, I store this matrix as a brine, then completely load it into memory, do batch processing and convert the samples in a batch to an array with a dense matrix, which I load into a model. It works fine as soon as all the data can fit into memory, but this method cannot be processed when I want to use more data.
I explored TFRecords as a way to serialize my data and read it more efficiently with a tensor stream, but I can't find any examples with sparse data.
I found an example for mnist:
writer = tf.python_io.TFRecordWriter("mnist.tfrecords")
where label
is int
, and functions are np.array
length 784, representing each pixel in the image as a float. I understand this approach, but I cannot reproduce it, since converting each row of my sparse matrix into a dense np.array
also be impractical.
It seems to me that I need to create a key for each function (column) and specify only for non-zero values ββfor each example, but I'm not sure that you can specify the default value (0 in my case) for "missing".
What would be the best way to do this?
source share