Using Sparse Matrices with Keras and Tensorflow

My data can be considered as a matrix of 10B records (100M x 100), which is very small (<1/100 * 1/100 records are not equal to zero). I would like to transfer data to the Keras neural network model I created using the Tensorflow backend.

My first thought was to expand the data so that it was dense, that is, write all 10B records to a CSV series, with most records being zero. However, this quickly overwhelms my resources (even doing ETL overloads pandas and makes postgres fight). So I need to use true sparse matrices.

How can I do this with Keras (and Tensorflow)? While Numpy does not support sparse matrices, Scipy and Tenorflow both support. There are many discussions (e.g. https://github.com/fchollet/keras/pull/1886 https://github.com/fchollet/keras/pull/3695/files https://github.com/pplonski/keras- sparse-check https://groups.google.com/forum/#!topic/keras-users/odsQBcNCdZg ) about this idea - either with meager sparse matrices, or by going directly to Tensorflow sparse matrices. But I can’t find a clear conclusion, and I couldn’t get anything to work (or even know exactly where to go!).

How can i do this?

I believe that there are two possible approaches:

  1. Save it as a lean, sparse matrix, and then, giving Keras a mini-packet, make it dense
  2. Keep it Sparse and Use Tensorflow Sparse Tensors

I also think that # 2 is preferable because you will get much better performance all over (I believe), but # 1 is probably simpler and will be adequate. I will be happy with any of them.

How can this be implemented?

+18
source share
2 answers

Sorry, I have no reputation to comment, but I think you should take a look at the answer here: Keras, sparse matrix problem . I tried this and it works correctly, only one note, although at least in my case the shuffling led to really bad results, so I used this slightly modified alternative:

def nn_batch_generator(X_data, y_data, batch_size):
    samples_per_epoch = X_data.shape[0]
    number_of_batches = samples_per_epoch/batch_size
    counter=0
    index = np.arange(np.shape(y_data)[0])
    while 1:
        index_batch = index[batch_size*counter:batch_size*(counter+1)]
        X_batch = X_data[index_batch,:].todense()
        y_batch = y_data[index_batch]
        counter += 1
        yield np.array(X_batch),y_batch
        if (counter > number_of_batches):
            counter=0

, keras shuffled ( shuffle=True fit).

+6

, . Keras Tensorflow, . .

from keras.layers import Dense, Input
from keras.models import Model
import scipy
import numpy as np

trainX = scipy.sparse.random(1024, 1024)
trainY = np.random.rand(1024, 1024)

inputs = Input(shape=(trainX.shape[1],), sparse=True)
outputs = Dense(trainY.shape[1], activation='softmax')(inputs)
model = Model(inputs=inputs, outputs=outputs)
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

steps = 10
for i in range(steps):
  # For simplicity, we directly use trainX and trainY in this example
  # Usually, this is where batches are prepared
  print(model.train_on_batch(trainX, trainY))
# [3549.2546, 0.0]
# ...
# [3545.6448, 0.0009765625]

, . , , . , .

0

Source: https://habr.com/ru/post/1666186/


All Articles