Python - saving a numpy array to a file (minimum size is possible)

Question

Python - saving a numpy array to a file (minimum size is possible)

Right now I have a python program that creates a fairly large 2D numpy array and saves it as a tab delimited text file using numpy.savetxt. The numpy array contains only float. Then I read the file on one line at a time in a separate C ++ program.

I would like to find a way to accomplish the same task by changing my code as small as possible to reduce the size of the file that I transfer between the two programs.

I found that I can use numpy.savetxt to save in a compressed .gz file instead of a text file. This reduces the file size from ~ 2 MB to ~ 100 kB.

Is there a better way to do this? Can I possibly write a numpy array in a binary to save space? If so, how can I do this so that I can still read it in a C ++ program?

Thanks for the help. I appreciate any advice I can get.

EDIT:

There are many zeros (perhaps 70% of the values in the numpy array are 0.0000). I'm not sure how I can use this in any way, and create a tiny file that my C ++ program can read in

+4

python numpy scipy

user1764386 Mar 12 '13 at 19:07

source share

5 answers

If you are not sure that you do not need to worry about content, etc., it is best to use numpy.savez as described in @unutbu's answer and @jorgeca's comment: tostring numpy / fromstring --- what do I need to specify to restore the array .

If the resulting size is not small enough, always zlib (on the python: import zlib side, on the C ++ side, I am sure that an implementation exists).

An alternative may be to use the hdf5 format: although this does not necessarily reduce the file size on disk, it speeds up saving / loading (a format was developed for this, for large data arrays). There are both pythons and C ++ for hdf5 - readers / writers.

+3

ev-br Mar 12 '13 at 19:44

source share

numpy.ndarray.tofile and numpy.fromfile are useful for direct binary output / input from python. std::ostream::write std::istream::read are useful for binary output / input in C ++.

You must be careful with endianess if data is transferred from one machine to another.

+1

Dave Mar 12 '13 at 19:18

source share

Use the hdf5 file, they are very easy to use with h5py and you can use the compression flag. Please note: hdf5 also has a C ++ interface.

+1

Luca fiaschi Oct 7 '13 at 14:02

source share

If you do not mind installing additional packages (for python and c++ ), you can use [BSON][1] (binary JSON).

0

shx2 Mar 12 '13 at 19:16

source share

Roland Smith · Accepted Answer · 2013-03-12T20:46:13+0000

Since you have many zeros, you can write non-zero elements in the form (index, number).

Suppose you have an array with a small number of nonzero numbers:

In [5]: a = np.zeros((10, 10)) In [6]: a Out[6]: array([[ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], [ 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.]]) In [7]: a[3,1] = 2.0 In [8]: a[7,4] = 17.0 In [9]: a[9,0] = 1.5

First, highlight the interesting numbers and their indices:

 In [11]: x, y = a.nonzero() In [12]: zip(x,y) Out[12]: [(3, 1), (7, 4), (9, 0)] In [13]: nonzero = zip(x,y)

Now you have only a small number of data items left. The simplest thing is to write them to a text file:

 In [17]: with open('numbers.txt', 'w+') as outf: ....: for r, k in nonzero: ....: outf.write('{:d} {:d} {:g}\n'.format(r, k, a[r,k])) ....: In [18]: cat numbers.txt 3 1 2 7 4 17 9 0 1.5

It also provides the ability to view data. In your C ++ program, you can read this data using fscanf .

But you can reduce the size even further by writing binary data using struct :

 In [17]: import struct In [19]: c = struct.Struct('=IId') In [20]: with open('numbers.bin', 'w+') as outf: ....: for r, k in nonzero: ....: outf.write(c.pack(r, k, a[r,k]))

The constructor argument Struct means; use the native date format '='. The first and second data elements are unsigned integers "I", the third is double "d".

In your C ++ program, this data is probably best read as binary data in packed Struct .

EDIT: answer updated for 2D array.

Python - saving a numpy array to a file (minimum size is possible)

More articles: