Very weak poor performance in Python / SQL Alchemy

I spent this day trying to debug a memory issue in my Python script. I am using SQL Alchemy as my ORM. There are some confusing questions here, and I hope that if I listed them all, someone could point me in the right direction.

In order to achieve the performance that I am looking for, I read all the records in the table (~ 400k), then scroll through the table, compare the records that I previously read, and then create new records (~ 800k) in another table. Here's what the code looks like:

dimensionMap = {} for d in connection.session.query(Dimension): dimensionMap[d.businessKey] = d.primarySyntheticKey # len(dimensionMap) == ~400k, sys.getsizeof(dimensionMap) == ~4MB allfacts = [] sheet = open_spreadsheet(path) for row in sheet.allrows(): dimensionId = dimensionMap[row[0]] metric = row[1] fact = Fact(dimensionId, metric) connection.session.add(fact) allfacts.append(fact) if row.number % 20000 == 0: connection.session.flush() # len(allfacts) == ~800k, sys.getsizeof(allfacts) == ~50MB connection.session.commit() sys.stdout.write('All Done') 

400k and 800k do not seem particularly large to me, but I still encounter memory problems with a machine with 4 GB of memory. This is really strange for me, as I ran sys.getsizeof in my two largest collections, and they were both of any size, which could cause problems.

Trying to figure this out, I noticed that the script is really, really slow. So I launched the profile on it, hoping that the results would lead me in the direction of the memory problem and come up with two confusing problems.

Profiler output

Firstly, 87% of the program’s time is spent on fixing, in particular on this line of code:

 self.transaction._new[state] = True 

This can be found in session.py:1367 . self.transaction._new is an instance of weakref.WeakKeyDictionary() . Why weakref:261:__setitem__ take so long?

Secondly, even when the program is completed ("Everything is ready" printed in standard mode), the script continues, apparently, forever, using 2.2 GB of used memory.

I did a few searches on weakrefs but did not see anyone mention the performance issues that I encountered. Ultimately, I can't do much with this, given that he is deeply immersed in SQL Alchemy, but I still appreciate any ideas.

Main lessons

As @zzzeek already mentioned, maintaining overhead objects requires a lot of overhead. Here is a small chart to show growth.

Total memory used vs number of persistent instances

The trend line assumes that each persistent instance takes about 2 KB of memory overhead, although the instance itself is only 30 bytes. This actually brings me another thing that I learned that sys.getsizeof should take with a huge amount of salt.

This function returns only the small size of the object and does not take into account any other objects that must be there so that the first object makes sense (for example, __dict__ ). You really need to use something like Heapy to get a good idea of ​​the actual amount of memory in the instance.

The last thing I learned is that when Python is on the verge of running out of memory and beating like crazy, a strange thing happens that should not be taken as part of the problem. In my case, a massive slowdown, a profile indicating the creation of a weakref, and freezing after the program terminated are all consequences of a memory problem. As soon as I stopped creating and maintaining persistent instances, and instead just used the properties of the objects that I needed, all other problems disappeared.

+6
source share
1 answer

ORM 800K objects are very large. These are Python objects, each of which has the __dict__ attribute, as well as the _sa_instance_state attribute, which itself is an object, which then has weak links and other things inside it, then Session has more than one weak link to your object - the ORM object is tracked by identifier, a function that provides a high degree of automation in persistence, but at the expense of more memory and function calls.

How important it is for your profiling to focus on this weakref line, it seems very strange, I would be interested to see the result of the actual profile there (see How can I profile an application based on SQLAlchemy? For background).

Your sample code can be modified to not use any objects mapped to an ORM identifier as follows. For more information about bulk inserts, see Why SQLAlchemy inserts with sqlite 25 times slower than directly with sqlite3? .

 # 1. only load individual columns - loading simple tuples instead # of full ORM objects with identity tracking. these tuples can be # used directly in a dict comprehension dimensionMap = dict( connection.session.query(Dimension.businessKey, Dimension.primarySyntheticKey) ) # 2. For bulk inserts, use Table.insert() call with # multiparams in chunks buf = [] for row in sheet.allrows(): dimensionId = dimensionMap[row[0]] metric = row[1] buf.append({"dimensionId": dimensionId, "metric": metric}) if len(buf == 20000): connection.session.execute(Fact.__table__.insert(), params=buf) buf[:] = [] connection.session.execute(Fact.__table__.insert(), params=buf) sys.stdout.write('All Done') 
+8
source

Source: https://habr.com/ru/post/954821/


All Articles