Build sparse matrix on disk on the fly in Python

Question

Build sparse matrix on disk on the fly in Python

I am currently doing heavy text processing, for which I need to build a sparse matrix from float32s with sizes ~ (2M, 5M) . I am building this matrix column by column when reading a corpus of 5M documents. To do this, I use a sparse dok_matrix data dok_matrix from SciPy . However, when you get to the 500,000th document, my memory is full (about 30 GB is used), and the program crashes. What I ultimately want to do is the algorithm for reducing the dimension on the matrix using sklearn , but as said, it is impossible to hold and build the entire matrix in memory. I looked through numpy.memmap since sklearn supports this and tried memmap some of the basic numpy data structures in a sparse SciPy matrix, but I could not do that.

It is not possible to save the entire matrix in a dense format, as this would require 40 TB of disk space. Therefore, I think that HDF5 and PyTables are not parameters for me (?).

Now my question is: how can I build a sparse matrix "on the fly", but write directly to disk instead of memory, and so I can use it later in sklearn?

Thanks!

+6

python numpy scipy memory matrix

Cedricdeboom Jun 24 '15 at 16:08

source share

1 answer

Lukas Koestler · Answer 1 · 2019-02-01T20:22:27+0000

This is not an answer, but I'm losing my reputation for comment, sorry. In response to a CedericDeBoom comment: numpy.memmap does not require the size to be known in advance, see also my answer here . Turning to the answer to the question " Using buffers with memory mapping for sparseness, I wrote the following test:

 s = 2**30 data = np.memmap("sp_data.bin", dtype=np.float32, mode="w+", shape=(s,)) indices = np.memmap("sp_indices.bin", dtype=np.int32, mode="w+", shape=(s,)) indptr = np.memmap("sp_indptr.bin", dtype=np.int32, mode="w+", shape=(s+1,)) A = sp.csc_matrix((data, indices, indptr), shape=(s,s))

while memmap files have a total size> 12 GB, less than 1 GB of RAM was used.

Therefore, I think it will indeed be possible to build data, indexes and indptr gradually, as shown in my previous answer, and then build scipy.csc_matrix from them.

Hope this helps. Best regards, Lucas

Build sparse matrix on disk on the fly in Python

More articles: