Numpy tofile () with very large arrays preserves all zeros

When I try to save a very large array (20,000 x 20,000 elements), I return all zeros:

In [2]: shape = (2e4,)*2 In [3]: r = np.random.randint(0, 10, shape) In [4]: r.tofile('r.data') In [5]: ls -lh r.data -rw-r--r-- 1 whg staff 3.0G 23 Jul 16:18 r.data In [6]: r[:6,:6] Out[6]: array([[6, 9, 8, 7, 4, 4], [5, 9, 5, 0, 9, 4], [6, 0, 9, 5, 7, 6], [4, 0, 8, 8, 4, 7], [8, 3, 3, 8, 7, 9], [5, 6, 1, 3, 1, 4]]) In [7]: r = np.fromfile('r.data', dtype=np.int64) In [8]: r = r.reshape(shape) In [9]: r[:6,:6] Out[9]: array([[0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0], [0, 0, 0, 0, 0, 0]]) 

np.save () does similar strange things.

After searching the network, I found that there is a known error in OSX:

https://github.com/numpy/numpy/issues/2806

When I try to read tostring () data from a file using Python read (), I get a memory error.

Is there a better way to do this? Can anyone recommend a pragmatic workaround?

+4
source share
1 answer

Use mmap to display the memory card and np.frombuffer to create an array that points to a buffer. Tested on x86_64 Linux:

 # `r.data` created as in the question >>> import mmap >>> with open('r.data') as f: ... m = mmap.mmap(f.fileno(), 0, mmap.MAP_SHARED, mmap.PROT_READ) ... >>> r = np.frombuffer(m, dtype='int64') >>> r = r.reshape(shape) >>> r[:6, :6] array([[7, 5, 9, 5, 3, 5], [2, 7, 2, 6, 7, 0], [9, 4, 8, 2, 5, 0], [7, 2, 4, 6, 6, 7], [2, 9, 2, 2, 2, 6], [5, 2, 2, 6, 1, 5]]) 

Please note that here r is a representation of data with memory mapping, which makes it more efficient in terms of memory, but has the side effect of automatically selecting changes to the contents of the file. If you want it to point to a private copy of the data, since the array returned by np.fromfile adds r = np.copy(r) .

(Also, as written, this will not work under Windows, which requires slightly different mmap flags.)

+1
source

Source: https://habr.com/ru/post/1492999/


All Articles