Multiple effective matrix multiplication

Question

Multiple effective matrix multiplication

To store a large matrix on disk, I use numpy.memmap.

Here is a sample code for testing large matrix multiplication:

import numpy as np import time rows= 10000 # it can be large for example 1kk cols= 1000 #create some data in memory data = np.arange(rows*cols, dtype='float32') data.resize((rows,cols)) #create file on disk fp0 = np.memmap('C:/data_0', dtype='float32', mode='w+', shape=(rows,cols)) fp1 = np.memmap('C:/data_1', dtype='float32', mode='w+', shape=(rows,cols)) fp0[:]=data[:] fp1[:]=data[:] #matrix transpose test tr = np.memmap('C:/data_tr', dtype='float32', mode='w+', shape=(cols,rows)) tr= np.transpose(fp1) #memory consumption? print fp1.shape print tr.shape res = np.memmap('C:/data_res', dtype='float32', mode='w+', shape=(rows,rows)) t0 = time.time() # redifinition ? res= np.dot(fp0,tr) #takes 342 seconds on my machine, if I multiplicate matrices in RAM it takes 345 seconds (I thinks it a strange result) res[:]= np.dot(fp0,tr) # assignment ? print res.shape print (time.time() - t0)

So my questions are:

How to limit memory consumption by an application using this procedure to a certain value, for example, up to 100 MB (or 1 GB or something else). Also, I don’t understand how to estimate memory consumption in a procedure (I think that memory is allocated only when a “data variable” is created, but how much memory is used when using memmap files?)
Maybe there is some optimal solution for multiplying large matrices stored on disk? For example, it is possible that data that is not optimally stored on disk or recorded from disk, not processed properly, and also a point product use only one core. Maybe I should use something like PyTables?

I am also interested in algorithms that solve a linear system of equations (SVD, etc.) with limited memory use. Maybe these algorithms are called “off-core” or “iterative”, and I think there is an analogy like ↔ ram, gpu ram ↔ cpu ram, cpu ram ↔ cpu cache.

Also here I found information about matrix propagation in PyTables.

I also found this in R, but I need it for Python or Matlab.

+6

python numpy matrix bigdata pytables

mrgloom Oct 14 '13 at 11:18

source share

2 answers

MRocklin · Answer 1 · 2015-05-01 19:21

Dask.array provides a layered interface for large arrays on disk using locked algorithms and task scheduling. It can easily perform matrix multiplication and other simple nump operations.

Blocked linear algebra is more complicated, and you can check out some of the academic papers on this topic. Dask supports QR and SVD factorization on high-precision matrices.

Regardless of large arrays, you really want to block algorithms, not naive workarounds, which will unpleasantly hit the disk.

fish2000 · Answer 2 · 2014-05-08 11:50

Consider using NumExpr for your processing: https://github.com/pydata/numexpr

... internally, NumExpr uses its own vectorized virtual machine developed around the chunked-read strategy to efficiently work with data blocks of optimal size in memory. He can easily beat the naive operations of NumPy if he is set up properly.

NumExpr may cover # 2 by problem. If you turned to No. 1 using a binary format with a data stream, you can use a clear reading approach when loading data files - for example:

  with open('path/to/your-data.bin', 'rb') as binary: while True: chunk = binary.read(4096) # or what have you if not chunk: break

If you are too low level for you, I would recommend you look at the HDF5 library: http://www.h5py.org - this is the best binary serialization solution for NumPy based structures that I know of. The h5py module supports compression, chunked reading, dtypes, metadata ... you name it.

Good luck

Multiple effective matrix multiplication

More articles: