How to estimate Pandas' DataFrame memory size?

I was wondering ... If I read, say, a 400 megabyte CSV file in a pandas dataframe (using read_csv or read_table), is there a way to estimate how much memory it will need? Just trying to get a better idea of ​​data frames and memory ...

+49
python pandas
Aug 6 '13 at 20:18
source share
7 answers

df.memory_usage() will return how much each column takes:

 >>> df.memory_usage() Row_ID 20906600 Household_ID 20906600 Vehicle 20906600 Calendar_Year 20906600 Model_Year 20906600 ... 

To enable indexes, go through index=True .

So, to get the total memory consumption:

 >>> df.memory_usage(index=True).sum() 731731000 

In addition, passing memory_usage = 'deep' will provide a more accurate memory usage report that takes into account the full use of the contained objects.

This is because memory usage does not include memory consumed by elements that are not components of the array if deep = False (default case).

+33
Oct 06 '15 at 12:34
source share

You have to do it the other way around.

 In [4]: DataFrame(randn(1000000,20)).to_csv('test.csv') In [5]: !ls -ltr test.csv -rw-rw-r-- 1 users 399508276 Aug 6 16:55 test.csv 

Technical memory of this (including indexes)

 In [16]: df.values.nbytes + df.index.nbytes + df.columns.nbytes Out[16]: 168000160 

So, 168 MB in memory with a 400 MB file, 1M rows of 20 floating-point columns

 DataFrame(randn(1000000,20)).to_hdf('test.h5','df') !ls -ltr test.h5 -rw-rw-r-- 1 users 168073944 Aug 6 16:57 test.h5 

MUCH is more compact when recording as an HDF5 binary

 In [12]: DataFrame(randn(1000000,20)).to_hdf('test.h5','df',complevel=9,complib='blosc') In [13]: !ls -ltr test.h5 -rw-rw-r-- 1 users 154727012 Aug 6 16:58 test.h5 

Data was random, so compression doesn't help much

+23
Aug 6 '13 at 21:00
source share

I thought I would bring some more data for discussion.

I conducted a series of tests on this issue.

Using the python resource package, I got memory usage in my process.

And by writing csv in the StringIO buffer, I could easily measure its size in bytes.

I conducted two experiments, each of which created 20 information frames with an increase in size between 10,000 lines and 1,000,000 lines. Both have 10 columns.

In the first experiment, I used only a float in my dataset.

Thus, the memory increased compared to the CSV file depending on the number of lines. (Size in megabytes)

Memory and CSV size in Megabytes as a function of the number of rows with float entries

The second experiment had the same approach, but the data in the data set consisted of only short rows.

Memory and CSV size in Megabytes as a function of the number of rows with string entries

It seems that the ratio of csv size and data block size can vary quite a lot, but the size in memory will always be 2-3 times larger (for frame sizes in this experiment)

I would like to complete this answer with a lot of experimentation, please comment if you want me to try something special.

+12
Jul 21 '15 at 15:29
source share

If you know the dtype your array, you can directly calculate the number of bytes that will be needed to store your data + some for the Python objects themselves. A useful attribute of the numpy array is nbytes . You can get the number of bytes from arrays in a pandas DataFrame by doing

 nbytes = sum(block.values.nbytes for block in df.blocks.values()) 

object dtype arrays store 8 bytes per object (object dtype arrays store a pointer to an opaque PyObject ), so if you have lines in your csv, you need to consider that read_csv will turn them into object dtype and adjust your calculations accordingly.

EDIT:

For more information on object dtype see the numpy page of scalar types . Since only this link is stored, you also need to consider the size of the object in the array. As this page says, object arrays are somewhat similar to Python list objects.

+8
Aug 6 '13 at 20:38
source share

Yes there is. Pandas will store your data in 2-dimensional numpy ndarray structures, grouping them by dtypes. ndarray is basically a C source data array with a small header. That way, you can estimate its size by simply multiplying the size of the dtype it contains with the dimensions of the array.

For example: if you have 1000 rows with 2 columns np.int32 and 5 np.float64 , your DataFrame will have one 2x1000 np.int32 and one 5x1000 np.float64 , which:

4 bytes * 2 * 1000 + 8bytes * 5 * 1000 = 48000 bytes

+5
Aug 6 '13 at 20:30
source share

This I believe that this gives the size of memory in memory of any object in python. Internal elements must be checked against pandas and numpy

 >>> import sys #assuming the dataframe to be df >>> sys.getsizeof(df) 59542497 
+5
Nov 14 '16 at 9:18
source share

Comparison of various methods

df is a dataframe with 814 rows, 11 columns (2 ints, 9 objects) - reading from a form file of size 427kb

df.info ()

 >>> df.info ()
 ...
 memory usage: 70.0+ KB

 >>> df.info (memory_usage = 'deep')
 ...
 memory usage: 451.6 KB

df.memory_usage ()

 >>> df.memory_usage ()
 ...
 (lists each column at 8 bytes / row)

 >>> df.memory_usage (). sum ()
 71712
 (roughly rows * cols * 8 bytes)

 >>> g.memory_usage (deep = True)
 (lists each column full memory usage)

 >>> g.memory_usage (deep = True) .sum ()
 462432

sys.getsizeof (DF)

 >>> import sys
 >>> sys.getsizeof (df)
 462456
+2
Dec 11 '17 at 11:06 on
source share



All Articles