Processing each row of a large database table in Python

Question

Processing each row of a large database table in Python

Context

I have a function in python that clogs a row in my table. I would like to combine the scores of all the lines arithmetically (for example, calculate the sum, average, etc.).

def compute_score(row): # some complicated python code that would be painful to convert into SQL-equivalent return score

The obvious first approach is to just read in all the data

 import psycopg2 def sum_scores(dbname, tablename): conn = psycopg2.connect(dbname) cur = conn.cursor() cur.execute('SELECT * FROM ?', tablename) rows = cur.fetchall() sum = 0 for row in rows: sum += score(row) conn.close() return sum

Problem

I would like to be able to process as much data as my database can store. It may be more than what fits into Python's memory, so fetchall() seems to me that in this case it will not function correctly.

Suggested Solutions

I considered 3 approaches, all with the goal of processing multiple records at a time:

Processing records one by one using fetchone()

 def sum_scores(dbname, tablename): ... sum = 0 for row_num in cur.rowcount: row = cur.fetchone() sum += score(row) ... return sum

Handle packet writing using fetchmany(n)

 def sum_scores(dbname, tablename): ... batch_size = 1e3 # tunable sum = 0 batch = cur.fetchmany(batch_size) while batch: for row in batch: sum += score(row) batch = cur.fetchmany(batch_size) ... return sum

Leaning on cursor iterator

 def sum_scores(dbname, tablename): ... sum = 0 for row in cur: sum += score(row) ... return sum

Questions

Was my opinion correct in that my 3 proposed solutions would only process data volumes with a controlled data size at a time? Or do they suffer from the same problem as fetchall ?
Which of the three proposed solutions will work (i.e., calculate the correct combination of scores, not a failure in the process) for LARGE data sets?
How does the cursor iterator (proposed solution # 3) actually retrieve data in Python memory? One by one, in batches or all at once?

+5

python psycopg2 bigdata

Pedro cattori Oct 17 '15 at 21:26

source share

1 answer

jastr · Accepted Answer · 2015-10-18T19:38:53+0000

All 3 solutions will work and only bring a subset of the results to memory.

Iterating with the cursor Proposed solution No. 3 will work in the same way as proposed solution No. 2 if you pass the name to the cursor. Iterating over the cursor will retrieve records with iterations (default is 2000).

Decisions No. 2 and No. 3 will be much faster than # 1, since the overhead is much less.

http://initd.org/psycopg/docs/cursor.html#fetch

Processing each row of a large database table in Python

Context

Problem

Suggested Solutions

Questions

More articles: