Processing each row of a large database table in Python

Context

I have a function in python that clogs a row in my table. I would like to combine the scores of all the lines arithmetically (for example, calculate the sum, average, etc.).

def compute_score(row): # some complicated python code that would be painful to convert into SQL-equivalent return score 

The obvious first approach is to just read in all the data

 import psycopg2 def sum_scores(dbname, tablename): conn = psycopg2.connect(dbname) cur = conn.cursor() cur.execute('SELECT * FROM ?', tablename) rows = cur.fetchall() sum = 0 for row in rows: sum += score(row) conn.close() return sum 

Problem

I would like to be able to process as much data as my database can store. It may be more than what fits into Python's memory, so fetchall() seems to me that in this case it will not function correctly.

Suggested Solutions

I considered 3 approaches, all with the goal of processing multiple records at a time:

  • Processing records one by one using fetchone()

     def sum_scores(dbname, tablename): ... sum = 0 for row_num in cur.rowcount: row = cur.fetchone() sum += score(row) ... return sum 
  • Handle packet writing using fetchmany(n)

     def sum_scores(dbname, tablename): ... batch_size = 1e3 # tunable sum = 0 batch = cur.fetchmany(batch_size) while batch: for row in batch: sum += score(row) batch = cur.fetchmany(batch_size) ... return sum 
  • Leaning on cursor iterator

     def sum_scores(dbname, tablename): ... sum = 0 for row in cur: sum += score(row) ... return sum 

Questions

  • Was my opinion correct in that my 3 proposed solutions would only process data volumes with a controlled data size at a time? Or do they suffer from the same problem as fetchall ?

  • Which of the three proposed solutions will work (i.e., calculate the correct combination of scores, not a failure in the process) for LARGE data sets?

  • How does the cursor iterator (proposed solution # 3) actually retrieve data in Python memory? One by one, in batches or all at once?

+5
source share
1 answer

All 3 solutions will work and only bring a subset of the results to memory.

Iterating with the cursor Proposed solution No. 3 will work in the same way as proposed solution No. 2 if you pass the name to the cursor. Iterating over the cursor will retrieve records with iterations (default is 2000).

Decisions No. 2 and No. 3 will be much faster than # 1, since the overhead is much less.

http://initd.org/psycopg/docs/cursor.html#fetch

+4
source

Source: https://habr.com/ru/post/1233937/


All Articles