Using Django ORM to handle a huge number of large records

I have a table containing about 30 thousand records that I am trying to iterate and process using Django ORM. Each record stores several binary blocks, each of which can have a size of several MB, which I need to process and write to a file.

However, I am having problems using Django for this due to memory limitations. I have 8 GB of memory in my system, but after processing about 5 thousand records, the Python process consumes all 8 GB and will be killed by the Linux kernel. I tried various tricks to clear the Django request cache, for example:

  • periodically calling MyModel.objects.update()
  • settings.DEBUG=False
  • periodically calling the Python garbage collector via gc.collect()

However, none of them seem to have a noticeable effect, and my process continues to experience some kind of memory leak until it works.

Is there anything else I can do?

Since I only need to process each record one at a time, and I never need to access the same record again in this process, I do not need to save an instance of the model or load more than one instance at a time. How do you ensure that only one record is loaded and that Django caches nothing and does not allocate all memory immediately after use?

+4
source share
1 answer

Try using an iterator.

QuerySet usually caches its results internally, so that repeated evaluations do not lead to additional queries. In contrast, Iterator () will read the results directly, without any QuerySet level (inside, the iterator by default calls iterator () and caches the return value). For a QuerySet that returns a large number of objects that you need to get only once, this can lead to better performance and a significant reduction in memory.

This is a quote from django docs: https://docs.djangoproject.com/en/dev/ref/models/querysets/#iterator

+8
source

Source: https://habr.com/ru/post/1445102/


All Articles