Handling large dense matrices in python

Basically, what is best for storing and using dense matrices in python?

I have a project that generates similarity metrics between each element in an array.

Each element is a custom class and stores a pointer to another class and a number representing its "proximity" to this class.

Currently, it works great up to about ~ 8000 elements, after which it crashes with an error from memory.
Basically, if you assume that each comparison uses ~ 30 (seems accurate based on testing) bytes to store the similarities, this means that the total memory required:
numItems^2 * itemSize = Memory
Thus, memory usage exponentially depends on the number of elements.
In my case, the memory size is ~ 30 bytes per link, so:
8000 * 8000 * 30 = 1,920,000,000 bytes, or 1.9 GB
which is on the right at the memory limit for one thread.

It seems to me that there should be a more efficient way to do this. I looked at memmapping, but it is already computationally intensive to generate similarity values, and the bottleneck on the hard drive seems a little ridiculous.

Edit
I looked at the countless and meager. Unfortunately, they do not support very large arrays.

 >>> np.zeros((20000,20000), dtype=np.uint16) Traceback (most recent call last): File "<stdin>", line 1, in <module> MemoryError >>> 

Further editing
Numpy seems to be popular. However, numpy will not do what I want, at least without another layer of abstraction.

I do not want to store numbers, I want to store class references. Numpy supports objects, but doesn’t really affect the size of the array. I cited numpy as an example of what doesn't work.

Any tips?

Edit Well, I just rewrote the whole logic so that it no longer stores extra values, reducing memory usage from O*n^2 to O*((n*(n-1))/2) .

Basically, this whole thing is a version of the handshake problem , so I switched from storing all the links to only one version of each link.

This is not a complete solution, but I don’t have any data sets large enough to overflow it, so I think it will work. PyTables is really interesting, but I don't know any SQL, and there seems to be no beautiful traditional slicing method or index based on accessing table data. I may return to the problem in the future.

+3
source share
6 answers

PyTables can handle tables of arbitrary size (millions of columns!) Using memmap and some clever compression.

Allegedly, it provides SQL as performance for python. However, it will require significant code modifications.

I will not accept this answer until I have done a more thorough check to make sure that he really can do what I want. Or someone offers the best solution.

+2
source

Well, I found my solution:
h5py

This is a library that basically is a numpy-like interface, but uses compressed memmapped files to store arrays of arbitrary size (mostly a shell for HDF5).

PyTables is built on it, and PyTables actually led me to this. However, I don't need any SQL functionality, which is the main PyTables offer, and PyTables does not provide the clean, massive interface that I was really looking for.

h5py basically acts like a numpy array and just saves the data in a different format.

It also has no restrictions on the size of the array, except, perhaps, disk space. I am currently testing on a uint16 array of size 100,000 * 100,000.

+10
source

For 20,000 x 20,000, are you looking at 12 GB of RAM?

Won't you find yourself in a swap-ad, trying to work with 12GB in win32, which artificially limits the memory that the OS can solve?

I would be looking for an OS that can support 12 GB (a 32-bit Win 2003 server can, if you need to stick to 32-bit windows), but a 64-bit machine with a 64-bit OS and 16 GB of RAM would look better.

Good excuse for updating :)

64-bit number can support your matrix

 Python 2.5.2 (r252:60911, Jan 20 2010, 23:14:04) [GCC 4.2.4 (Ubuntu 4.2.4-1ubuntu3)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import numpy as np >>> np.zeros((20000,20000),dtype=np.uint16) array([[0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], ..., [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0], [0, 0, 0, ..., 0, 0, 0]], dtype=uint16) 
+1
source
0
source

If you have N objects stored in the L list and you want to preserve the similarity between each object and each other, this is O(N**2) similarity. Under normal conditions, that similarity(A, B) == similarity(B, A) and similarity(A, A) == 0 , all you need is a triangular array of S similarities. The number of elements in this array will be N*(N-1)//2 . For this purpose you should use array.array. Saving your affinity as a float will only take 8 bytes. If you can represent your similarity as an integer in range(256) , you use an unsigned byte as an element of array.array.

So about 8000 * 8000/2 * 8, i.e. about 256 MB. Using only a byte for affinity means only 32 MB. You could avoid the slow calculation of the index S[i*Ni*(i+1)//2+j] the thingie triangle by simulating a square array instead of S [i * N + j] `; memory doubles to (512 MB for float, 64 MB for byte).

If the above does not suit you, perhaps you could explain "" Each element [in which container?] Is a custom class and stores a pointer to another class, and the number representing its "proximity" is this class. "" and "" I do not want to store numbers, I want to keep a reference to the classes "" ". Even after replacing" class (es) "with" object (s) ", I struggle to understand what you mean.

0
source

You can reduce memory usage with uint8, but be careful to avoid overflow errors. Uint16 requires two bytes, so the minimum memory requirement in your example is 8000 * 8000 * 30 * 2 bytes = 3.84 GB.

If the second example fails, you need a new computer. The memory requirement is 20,000 * 20,000 * 2 * bytes = 800 MB.

My advice is that you are trying to create smaller matrices and use "top", "ps v" or the gnome system monitor to check the memory used by your python processes. Start learning one small matrix flow and do the math. Note that you can free the memory of the variable x by writing del (x). This is useful for testing.

What is the memory on your computer? How much memory does pytables use to create a 20,000 * 20,000 table? How much memory does numpy use to create a 20,000 * 20,000 table using uint8?

-1
source

Source: https://habr.com/ru/post/1479987/


All Articles