Context
I have a function in python that clogs a row in my table. I would like to combine the scores of all the lines arithmetically (for example, calculate the sum, average, etc.).
def compute_score(row):
The obvious first approach is to just read in all the data
import psycopg2 def sum_scores(dbname, tablename): conn = psycopg2.connect(dbname) cur = conn.cursor() cur.execute('SELECT * FROM ?', tablename) rows = cur.fetchall() sum = 0 for row in rows: sum += score(row) conn.close() return sum
Problem
I would like to be able to process as much data as my database can store. It may be more than what fits into Python's memory, so fetchall() seems to me that in this case it will not function correctly.
Suggested Solutions
I considered 3 approaches, all with the goal of processing multiple records at a time:
Processing records one by one using fetchone()
def sum_scores(dbname, tablename): ... sum = 0 for row_num in cur.rowcount: row = cur.fetchone() sum += score(row) ... return sum
Handle packet writing using fetchmany(n)
def sum_scores(dbname, tablename): ... batch_size = 1e3
Leaning on cursor iterator
def sum_scores(dbname, tablename): ... sum = 0 for row in cur: sum += score(row) ... return sum
Questions
Was my opinion correct in that my 3 proposed solutions would only process data volumes with a controlled data size at a time? Or do they suffer from the same problem as fetchall ?
Which of the three proposed solutions will work (i.e., calculate the correct combination of scores, not a failure in the process) for LARGE data sets?
How does the cursor iterator (proposed solution # 3) actually retrieve data in Python memory? One by one, in batches or all at once?
source share