Why does numpy narray read from a file, consume so much memory?

Question

Why does numpy narray read from a file, consume so much memory?

file contains 2,000,000 lines: each line contains 208 columns, separated by commas, for example:

  0.0863314058048.0.0208767447842.0.03358010485.0.0.1.0.0.0.0.314285714286.0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,1,0,0,0,0,0,0,0

The program reads this file in numpy narray, I expected it to consume about (2000000 * 208 * 8B) = 3.2GB memory. However, when the program reads this file, I find that the program consumes about 20 GB of memory.

I am confused by why my program consumes so much memory that does not meet expectations?

+6

python arrays numpy file-io

祝方泽 Oct 26 '14 at 4:39

source share

2 answers

I think you should try pandas process big data (text files). pandas is similar to excel in python. And it internally uses numpy to represent data.

HDF5 files are also another way to save big data in the hdf5 binary.

This question will give some insight on how to handle large files - Big Data workflows using pandas

0

Haridas n Oct 26 '14 at 7:17

source share

Saullo castro · Accepted Answer · 2014-10-26T07:29:27+0000

I am using Numpy 1.9.0, and the amount of memory np.loadtxt() and np.genfromtxt() seems to be directly related to the fact that they are based on temporary lists for storing data:

see here for np.loadtxt()
and here for np.genfromtxt()

Knowing in advance the shape your array, you can think of a file reader that will consume a memory size very close to the theoretical memory size (3.2 GB for this case), by storing data using the appropriate dtype :

 def read_large_txt(path, delimiter=None, dtype=None): with open(path) as f: nrows = sum(1 for line in f) f.seek(0) ncols = len(f.next().split(delimiter)) out = np.empty((nrows, ncols), dtype=dtype) f.seek(0) for i, line in enumerate(f): out[i] = line.split(delimiter) return out

Why does numpy narray read from a file, consume so much memory?

More articles: