This is a memory leak (program in python with sqlalchemy / sqlite)

My following code is working on a large dataset (2M). He survives all my 4G memory until graduation.

for sample in session.query(CodeSample).yield_per(100): for proj in projects: if sample.filename.startswith(proj.abs_source): sample.filename = "some other path" session.add(sample) 

Then I ran it, although I reduced the data set and analyzed the heapy heap. get_rp () gave me the following hint

 0: _ --- [-] 47821 (0x9163aec | 0x9165fec | 0x916d6cc | 0x9251414 | 0x925704... 1: a [-] 8244 tuple: 0x903ec8c*37, 0x903fcfc*13, 0x9052ecc*46... 2: aa ---- [S] 3446 types.CodeType: parseresult.py:73:src_path... 3: ab [S] 364 type: __builtin__.Struct, _random.Random, sqlite3.Cache... 4: ac ---- [-] 90 sqlalchemy.sql.visitors.VisitableType: 0x9162f2c... 5: aca [S] 11 dict of module: ..sql..., codemodel, sqlalchemy 6: acb ---- [-] 48 sqlalchemy.sql.visitors.VisitableType: 0x9162f2c... 7: acba [S] 9 dict of module: ..sql..., codemodel, sqlalchemy 8: acbb ---- [-] 45 sqlalchemy.sql.visitors.VisitableType: 0x9165fec... 9: acbba [S] 8 dict of module: ..sql..., codemodel, sqlalchemy 

I am new to sqlalchemy. Is it a memory leak? Thanks.

+4
source share
2 answers

Most DBAPIs, including psycopg2 and mysql-python, completely load all the results into memory before releasing them to the client. The SQLA yield_per () parameter does not work around this, with one exception below, so it is not a very useful option at all (edit: useful in the sense that it starts passing results before the actual rows are fully extracted).

Exceptions to this behavior are:

  • Using a DBAPI that does not buffer rows. cx_oracle is one, as a result of the natural action of the OCI. Not sure about the behavior of pg8000, or the new MySQL DBAPI called OurSQL, which, as its creator told me, does not buffer rows. pg8000 and OurSQL are supported by SQLAlchemy 0.6.
  • With psycopg2, a server side cursor can be used. SQLAlchemy supports the create_engine () flag "server_side_cursors = True", which uses server cursors for all row selection operations. However, since server cursors are usually expensive and therefore decrease performance for small queries, SQLAlchemy 0.6 now supports the server-side cursor psycopg2 for each statement or for each query using .execution_options (stream_results = True), where execute_options is available on Query, select (), text () and Connection. The Query object calls this parameter when yield_per () is used, so in 0.6 yield_per () combined with psycopg2 is really useful.
+6
source

The session will track all CodeSample objects that you retrieve. So, after iterating over 2M objects, the session contains a link to all of them. The session needs these links so that it can write the correct changes to the database on flush . Therefore, I believe what you expect.

To save only N objects in memory at a time, you can do something like the code below (inspired by this answer , disclaimer: I have not tested it).

 offset = 0 N = 10000 got_rows = True while got_rows: got_rows = False for sample in session.query(CodeSample).limit(N).offset(offset): got_rows = True for proj in projects: if sample.filename.startswith(proj.abs_source): sample.filename = "some other path" offset += N session.flush() # writes changes to DB session.expunge_all() # removes objects from session 

But the above is a little inconvenient, maybe some SQLAlchemy gurus know how to do this better.

By the way, you will not need session.add (), the session tracks changes in objects. Why are you using yield_per ( EDIT: I assume this is needed to extract rows in chunks from the database, is this correct? In any case, the session will keep track of all of them.)

EDIT:

Hmm, it seems like I didn’t understand something. From the docs :

weak_identity_map: If the default value is set to True, a map with a weak link is used; instances that are not external link will be garbage collected immediately. For dereferenced instances that are currently waiting for changes, the attribute management system will create a temporary strong reference to the object, which lasts until the database changes, after which it is dereferenced again. On the other hand, when using the False value, the identification card uses a regular Python dictionary to store instances. The session will support all instances until they are deleted using expunge (), clear (), or purge ().

and

prune (): delete unlinked instances cached in the identity card.

Please note that this method only makes sense if the parameter "weak_identity_map" is set to False. By default, a weak ID card is self-limiting.

Deletes any object in this session ID that is not mentioned in the user code, modified, not set, or scheduled to be deleted. Returns the number of trimmed objects.

+2
source

Source: https://habr.com/ru/post/1299513/


All Articles