Large array on disk for numpy

I have a rare array that seems too large to handle memory efficiently (2000x2500000, float). I can form it into a sparse lil_array (scipy), but if I try to output a compressed sparse array of columns or rows (A.tocsc (), A.tocsr ()), my machine runs out of memory (and there is also a serious discrepancy between the data in 4.4G text file and assembled lil 12G array - it would be nice to have a disk format that is closer to the size of the raw data).

In the future, I will probably handle even larger arrays.

Question: What is the best way to handle large arrays on disk in such a way that I can use normal numpy functions in a transparent way. For example, sums along rows and columns, vector products, max, min, slicing, etc.?

Is pytables possible? Is there a good (fast) sql-numpy middleware layer? secret to disk array embedded in numpy?

In the past with (slightly smaller) arrays, I always made selections with the accumulated cache of long calculations to disk. This works when arrays eventually become <4G or so, but not more durable.

+6
source share
1 answer

I often use memory-mapped numpy arrays to handle multi-gigabit numeric matrices. I believe that they work very well for my purposes. Obviously, if the size of the data exceeds the amount of RAM, you need to be careful about access patterns to avoid thrashing .

+2
source

Source: https://habr.com/ru/post/914148/


All Articles