I am using numpy (1.13.1) and pandas (0.20.3) on Ubuntu 16.10 with python 2.7 or 3.5 (same problems on both).
I studied pandas memory processing (especially when it copies or doesn't copy data) and is facing a serious memory problem that I don't understand. While I saw (many) other issues related to his memory, I did not find that this directly affects this problem.
In particular, pandas allocates a lot more memory than I ask. I noticed a rather strange behavior when I just try to allocate a DataFrame with a column of a certain size:
import pandas as pd, numpy as np
GB = 1024**3
df = pd.DataFrame()
df['MyCol'] = np.ones(int(1*GB/8), dtype='float64')
When I do this, I see that my python process actually allocates 6 GB of memory (12G if I ask for 2 GB, 21 GB if I ask for 3 GB, and my computer will crash if I ask for 4 GB: - /), unlike 1 GB, which was expected. At first I thought that maybe Python does aggressive preallocation, however, if I only create a numpy array, I get exactly how much memory I request every time, be it 1GB, 10GB, 25GB, whatever.
Also, more interestingly, if I changed the code a bit:
df['MyCol'] = np.ones(int(1*GB), dtype='uint8')
, ( numpy 1 ). ( 2017/8/17. pandas (0.20.3) numpy (1.13.1), 64 . , 64 (ish) GB .)
, pandas , , , , . .
, . , , , - , , , , .
?
:
- , pandas () memory_usage() (.. 1 , 1 , 6-10 ).
- ( memory_usage(), ).
- pandas DataFrame (df = None, gc.collect()) . .