The correct way to perform operations with Memmapped arrays

Question

The correct way to perform operations with Memmapped arrays

The operation with which I got confused looks like this. I do this on regular Numpy arrays, but on memmap I want to get information about how it all works.

arr2 = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100 #This is basically to calculate Percentile rank of each value wrt the entire column

This is what I used for a regular numpy array.

Now. Given that arr1 is now an array with 20 GB of memory, I have a few questions:

1:

 arr2 = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100

arr2 will be a regular numpy array, would I suggest? So doing this would be a catastrophic correct memory rule?

Given that I created arr2 as a memmapped array of the correct size (filled with all zeros).

2

 arr2 = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100

against

 arr2[:] = np.argsort(np.argsort(arr1,axis=0),axis=0) / float(len(arr1)) * 100

What is the difference?

3.

Would it be more memory efficient to separately compute np.argsort as a temporary memmapped array and np.argsort(np.argsort) as a temporary memmapped array, and then perform the operation? Since the argsort array from the 20 GB array will be very large!

I think these questions will help me understand the inner workings of memmapped arrays in python!

Thanks...

+5

python numpy

user1265125 Aug 29 '14 at 11:41

source share

1 answer

AP · Accepted Answer · 2014-09-03T22:44:53+0000

First I will try to answer part 2, and then 1 and 3.

First, arr = <something> is a simple variable assignment, while arr[:] = <something> assigns the contents of an array. In the code below, after arr[:] = x , arr is still a memmapped array, whereas after arr = x , arr is ndarray.

 >>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000)) >>> type(arr) <class 'numpy.core.memmap.memmap'> >>> x = np.ones((1,10000000)) >>> type(x) <class 'numpy.ndarray'> >>> arr[:] = x >>> type(arr) <class 'numpy.core.memmap.memmap'> >>> arr = x >>> type(arr) <class 'numpy.ndarray'>

In the case of np.argsort it returns an array of the same type of its argument. Therefore, in this particular case, I think that there should be no difference between executing arr = np.argsort(x) or arr[:] = np.argsort(x) . In your code, arr2 will have a memmapped array. But there is a difference.

 >>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000)) >>> x = np.ones((1,10000000)) >>> arr[:] = x >>> type(np.argsort(x)) <class 'numpy.ndarray'> >>> type(np.argsort(arr)) <class 'numpy.core.memmap.memmap'>

OK, now that’s different. Using arr[:] = np.argsort(arr) , if we look at the changes to the memmapped file, we will see that every change to arr is followed by a change to the md5sum file.

 >>> import os >>> import numpy as np >>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000)) >>> arr[:] = np.zeros((1,10000000)) >>> os.system("md5sum mm") 48e9a108a3ec623652e7988af2f88867 mm 0 >>> arr += 1.1 >>> os.system("md5sum mm") b8efebf72a02f9c0b93c0bbcafaf8cb1 mm 0 >>> arr[:] = np.argsort(arr) >>> os.system("md5sum mm") c3607e7de30240f3e0385b59491ac2ce mm 0 >>> arr += 1.3 >>> os.system("md5sum mm") 1e6af2af114c70790224abe0e0e5f3f0 mm 0

We see that arr retains its _mmap attribute.

 >>> arr._mmap <mmap.mmap object at 0x7f8e0f086198>

Now, using arr = np.argsort(x) , we see that md5sums stop changing. Even if the arr type is a memmapped array, it is a new object, and it seems that the memory mapping has been removed.

 >>> import os >>> import numpy as np >>> arr = np.memmap('mm', dtype='float32', mode='w+', shape=(1,10000000)) >>> arr[:] = np.zeros((1,10000000)) >>> os.system("md5sum mm") 48e9a108a3ec623652e7988af2f88867 mm 0 >>> arr += 1.1 >>> os.system("md5sum mm") b8efebf72a02f9c0b93c0bbcafaf8cb1 mm 0 >>> arr = np.argsort(arr) >>> os.system("md5sum mm") b8efebf72a02f9c0b93c0bbcafaf8cb1 mm 0 >>> arr += 1.3 >>> os.system("md5sum mm") b8efebf72a02f9c0b93c0bbcafaf8cb1 mm 0 >>> type(arr) <class 'numpy.core.memmap.memmap'>

Now the attribute '_mmap' is None.

 >>> arr._mmap >>> type(arr._mmap) <class 'NoneType'>

Now part 3. It seems pretty easy to lose the reference to the memmapped object when performing complex operations. My real understanding is that you have to break the work and use arr[:] = <> for intermediate results.

Using numpy 1.8.1 and Python 3.4.1

The correct way to perform operations with Memmapped arrays

More articles: