Pandas uses significantly more storage than query

I am using numpy (1.13.1) and pandas (0.20.3) on Ubuntu 16.10 with python 2.7 or 3.5 (same problems on both).

I studied pandas memory processing (especially when it copies or doesn't copy data) and is facing a serious memory problem that I don't understand. While I saw (many) other issues related to his memory, I did not find that this directly affects this problem.

In particular, pandas allocates a lot more memory than I ask. I noticed a rather strange behavior when I just try to allocate a DataFrame with a column of a certain size:

import pandas as pd, numpy as np
GB = 1024**3
df = pd.DataFrame()
df['MyCol'] = np.ones(int(1*GB/8), dtype='float64')

When I do this, I see that my python process actually allocates 6 GB of memory (12G if I ask for 2 GB, 21 GB if I ask for 3 GB, and my computer will crash if I ask for 4 GB: - /), unlike 1 GB, which was expected. At first I thought that maybe Python does aggressive preallocation, however, if I only create a numpy array, I get exactly how much memory I request every time, be it 1GB, 10GB, 25GB, whatever.

Also, more interestingly, if I changed the code a bit:

df['MyCol'] = np.ones(int(1*GB), dtype='uint8')

, ( numpy 1 ). ( 2017/8/17. pandas (0.20.3) numpy (1.13.1), 64 . , 64 (ish) GB .)

, pandas , , , , . .

, . , , , - , , , , .

?

:

  • , pandas () memory_usage() (.. 1 , 1 , 6-10 ).
  • ( memory_usage(), ).
  • pandas DataFrame (df = None, gc.collect()) . .
+4
1

, 8000 :

In [248]: x=np.ones(1000)

In [249]: df=pd.DataFrame({'MyCol': x}, dtype=float)
In [250]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 1 columns):
MyCol    1000 non-null float64
dtypes: float64(1)
memory usage: 15.6 KB

8k 8k .

- x:

In [251]: df['col2']=x
In [252]: df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1000 entries, 0 to 999
Data columns (total 2 columns):
MyCol    1000 non-null float64
col2     1000 non-null float64
dtypes: float64(2)
memory usage: 23.4 KB

In [253]: x.nbytes
Out[253]: 8000
0

Source: https://habr.com/ru/post/1666283/


All Articles