Encoding scipy sparse matrix as TFRecords

I am creating a tensor flow model where the input is a large scipy sparse matrix, each row is a sample of dimension> 50k, where only a few hundred values ​​are nonzero.

Currently, I store this matrix as a brine, then completely load it into memory, do batch processing and convert the samples in a batch to an array with a dense matrix, which I load into a model. It works fine as soon as all the data can fit into memory, but this method cannot be processed when I want to use more data.

I explored TFRecords as a way to serialize my data and read it more efficiently with a tensor stream, but I can't find any examples with sparse data.

I found an example for mnist:

writer = tf.python_io.TFRecordWriter("mnist.tfrecords") # construct the Example protob oject example = tf.train.Example( # Example contains a Features proto object features=tf.train.Features( # Features contains a map of string to Feature proto objects feature={ # A Feature contains one of either a int64_list, # float_list, or bytes_list 'label': tf.train.Feature( int64_list=tf.train.Int64List(value=[label])), 'image': tf.train.Feature( int64_list=tf.train.Int64List(value=features.astype("int64"))), })) # use the proto object to serialize the example to a string serialized = example.SerializeToString() # write the serialized object to disk writer.write(serialized) 

where label is int , and functions are np.array length 784, representing each pixel in the image as a float. I understand this approach, but I cannot reproduce it, since converting each row of my sparse matrix into a dense np.array also be impractical.

It seems to me that I need to create a key for each function (column) and specify only for non-zero values ​​for each example, but I'm not sure that you can specify the default value (0 in my case) for "missing".

What would be the best way to do this?

+6
source share

Source: https://habr.com/ru/post/1011512/


All Articles