Why does numpy narray read from a file, consume so much memory?

file contains 2,000,000 lines: each line contains 208 columns, separated by commas, for example:

  0.0863314058048.0.0208767447842.0.03358010485.0.0.1.0.0.0.0.314285714286.0.336293217457,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0, 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0 0,1,0,0,0,0,0,0,0 

The program reads this file in numpy narray, I expected it to consume about (2000000 * 208 * 8B) = 3.2GB memory. However, when the program reads this file, I find that the program consumes about 20 GB of memory.

I am confused by why my program consumes so much memory that does not meet expectations?

+6
source share
2 answers

I am using Numpy 1.9.0, and the amount of memory np.loadtxt() and np.genfromtxt() seems to be directly related to the fact that they are based on temporary lists for storing data:

  • see here for np.loadtxt()
  • and here for np.genfromtxt()

Knowing in advance the shape your array, you can think of a file reader that will consume a memory size very close to the theoretical memory size (3.2 GB for this case), by storing data using the appropriate dtype :

 def read_large_txt(path, delimiter=None, dtype=None): with open(path) as f: nrows = sum(1 for line in f) f.seek(0) ncols = len(f.next().split(delimiter)) out = np.empty((nrows, ncols), dtype=dtype) f.seek(0) for i, line in enumerate(f): out[i] = line.split(delimiter) return out 
+2
source

I think you should try pandas process big data (text files). pandas is similar to excel in python. And it internally uses numpy to represent data.

HDF5 files are also another way to save big data in the hdf5 binary.

This question will give some insight on how to handle large files - Big Data workflows using pandas

0
source

Source: https://habr.com/ru/post/977226/


All Articles