I have a rare array that seems too large to handle memory efficiently (2000x2500000, float). I can form it into a sparse lil_array (scipy), but if I try to output a compressed sparse array of columns or rows (A.tocsc (), A.tocsr ()), my machine runs out of memory (and there is also a serious discrepancy between the data in 4.4G text file and assembled lil 12G array - it would be nice to have a disk format that is closer to the size of the raw data).
In the future, I will probably handle even larger arrays.
Question: What is the best way to handle large arrays on disk in such a way that I can use normal numpy functions in a transparent way. For example, sums along rows and columns, vector products, max, min, slicing, etc.?
Is pytables possible? Is there a good (fast) sql-numpy middleware layer? secret to disk array embedded in numpy?
In the past with (slightly smaller) arrays, I always made selections with the accumulated cache of long calculations to disk. This works when arrays eventually become <4G or so, but not more durable.
source share