I need to read parts of a huge numpy array stored in a memory mapped file, process the data and repeat for another part of the array. The entire numpy array takes up about 50 GB, and my computer has 8 GB of RAM.
First, I created a memory mapping file using numpy.memmap, by reading a large number of small files and processing their data, and then writing the processed data to a memmap file. When creating the memmap file, I had no memory problems (I periodically used memmap.flush ()). This is how I create a memory mapped file:
mmapData = np.memmap(mmapFile,mode='w+', shape=(large_no1,large_no2)) for i1 in np.arange(numFiles): auxData = load_data_from(file[i1]) mmapData[i1,:] = auxData mmapData.flush() % Do this every 10 iterations or so
However, when I try to access small parts (<10 MB) of the memmap file, it floods my entire drum when creating the memmap object. The car slows sharply, and I can do nothing. This is how I try to read data from a memory mapped file:
mmapData = np.memmap(mmapFile, mode='r',shape=(large_no1,large_no2)) aux1 = mmapData[5,1:1e7]
I thought using mmap or numpy.memmap should allow me to access arrays of arrays without trying to load all this into memory. What am I missing?
Am I using the wrong tool to access parts of a large numpy array (> 20 GB) stored on disk?
source share