I spent this day trying to debug a memory issue in my Python script. I am using SQL Alchemy as my ORM. There are some confusing questions here, and I hope that if I listed them all, someone could point me in the right direction.
In order to achieve the performance that I am looking for, I read all the records in the table (~ 400k), then scroll through the table, compare the records that I previously read, and then create new records (~ 800k) in another table. Here's what the code looks like:
dimensionMap = {} for d in connection.session.query(Dimension): dimensionMap[d.businessKey] = d.primarySyntheticKey # len(dimensionMap) == ~400k, sys.getsizeof(dimensionMap) == ~4MB allfacts = [] sheet = open_spreadsheet(path) for row in sheet.allrows(): dimensionId = dimensionMap[row[0]] metric = row[1] fact = Fact(dimensionId, metric) connection.session.add(fact) allfacts.append(fact) if row.number % 20000 == 0: connection.session.flush() # len(allfacts) == ~800k, sys.getsizeof(allfacts) == ~50MB connection.session.commit() sys.stdout.write('All Done')
400k and 800k do not seem particularly large to me, but I still encounter memory problems with a machine with 4 GB of memory. This is really strange for me, as I ran sys.getsizeof in my two largest collections, and they were both of any size, which could cause problems.
Trying to figure this out, I noticed that the script is really, really slow. So I launched the profile on it, hoping that the results would lead me in the direction of the memory problem and come up with two confusing problems.

Firstly, 87% of the programβs time is spent on fixing, in particular on this line of code:
self.transaction._new[state] = True
This can be found in session.py:1367 . self.transaction._new is an instance of weakref.WeakKeyDictionary() . Why weakref:261:__setitem__ take so long?
Secondly, even when the program is completed ("Everything is ready" printed in standard mode), the script continues, apparently, forever, using 2.2 GB of used memory.
I did a few searches on weakrefs but did not see anyone mention the performance issues that I encountered. Ultimately, I can't do much with this, given that he is deeply immersed in SQL Alchemy, but I still appreciate any ideas.
Main lessons
As @zzzeek already mentioned, maintaining overhead objects requires a lot of overhead. Here is a small chart to show growth.

The trend line assumes that each persistent instance takes about 2 KB of memory overhead, although the instance itself is only 30 bytes. This actually brings me another thing that I learned that sys.getsizeof should take with a huge amount of salt.
This function returns only the small size of the object and does not take into account any other objects that must be there so that the first object makes sense (for example, __dict__ ). You really need to use something like Heapy to get a good idea of ββthe actual amount of memory in the instance.
The last thing I learned is that when Python is on the verge of running out of memory and beating like crazy, a strange thing happens that should not be taken as part of the problem. In my case, a massive slowdown, a profile indicating the creation of a weakref, and freezing after the program terminated are all consequences of a memory problem. As soon as I stopped creating and maintaining persistent instances, and instead just used the properties of the objects that I needed, all other problems disappeared.