Efficient data migration on django large table

Question

Efficient data migration on django large table

I need to add a new column to a large (5 meter row) django table. I have a schemamigration south that creates a new column. Now I am writing a datamigration script to populate a new column. It looks like this. (If you are not familiar with south transitions, just ignore orm. , The model name prefix.)

 print "Migrating %s articles." % orm.Article.objects.count() cnt = 0 for article in orm.Article.objects.iterator(): if cnt % 500 == 0: print " %s done so far" % cnt # article.newfield = calculate_newfield(article) article.save() cnt += 1

I switched from objects.all to objects.iterator to reduce memory requirements. But something is still chewing huge memory when I run this script. Even with the really useful line commented above, the script is still growing to use 10+ GB of RAM before going very far through the table, and I give it up.

Something seems to hold onto these objects in memory. How can I run this so that it is not in memory?

FWIW, I am using python 2.6, django 1.2.1, south 0.7.2, mysql 5.1.

+6

performance python django django-models django-south

Leopd Jun 07 '11 at 21:35

source share

5 answers

Welcome to Django ORM. I think this is an inherent problem.

I also had problems with large databases, dumpdata, loaddata, etc.

You have two options.

Do not try to use the south and record your own ORM migration. You can have several database definitions in your settings. Create "old" and "new". Write your own peer migrator from the old database to the new database. After it is tested and working, run it one last time, and then switch the database definitions and restart Django.
Catch south and ORM and write your own SQL migration. Use raw SQL to copy data from the old structure to the new structure. Debugging separately. When it's good, run it one last time, and then switch settings and restart Django.

It is not that the south or the ORM are particularly bad. But for mass processing in large databases, they cache too much in memory.

+2

S. Lott Jun 07 '11 at 21:40

source share

orm.Article.objects.iterator()

Does this query execute the entire query and store the result in memory? Or select rows from the database one at a time?

I guess all at once. See if you can replace this loop with a database cursor that supplies data incrementally:

for example: http://docs.python.org/library/sqlite3.html#sqlite3.Cursor.fetchmany

 db = blah.connect("host='%s' dbname='%s' user='%s' password='%s'" % ... new, old = db.cursor(), db.cursor() old.execute(""" SELECT * FROM whatever """) for row in old.fetchmany(size=500): (col1, col2, col3...) = row new = db.cursor() new.execute(""" INSERT INTO yourtable ( col1, col2, col3...) VALUES ( %s, %s, %s, %s, %s) """,(col1, col2, col3,...)) new.close() old.close()

It will be slow. I pulled this from an offline migration script from mine, therefore ymmv.

fetchmany is standard (PEP249). I haven’t done exactly what you are looking for, so there’s still a bit of work to follow on this pattern: I don’t get hung up on the loop — to get a set of 500 to the end — so you need to do this for yourself.

+2

John mee Jun 08 '11 at 3:22

source share

Or, what happens if you create a raw in situ query that implements a rudimentary size limit on the result set?

a la: https://docs.djangoproject.com/en/1.3/topics/db/sql/#index-lookups

 while min < rowcount: min += 500 max = min + 500 articles = Article.objects.raw('SELECT * from article where id > %s and id < %s' % (min, max)) for old_article in articles: # create the new article article.save()

+2

John mee Jun 08 '11 at 3:41

source share

If you do not need full access to objects, you can always use only and values or values_list in your request. This should significantly reduce memory requirements, but I'm not sure if that will be enough.

+1

Chris pratt Jun 07 '11 at 21:55

source share

Steve k · Accepted Answer · 2011-08-12T11:39:32+0000

Make sure settings.DEBUG set to False . DEBUG=True fills the memory especially with operations with the database intensity, since it stores all the requests sent to the DBMS in the form.

With Django 1.8 out, this should not be necessary, since now a hard-coded maximum of 9000 requests are stored, and not an infinite number before.

Efficient data migration on django large table

More articles: