To indicate as much / as much as possible, I am trying to pull some data stored on a remote postgres server (heroku) into a pandas DataFrame using psycopg2 to connect.
I’m interested in two specific tables, users and events, and the connection works fine, because when pulling user data
import pandas.io.sql as sql
after waiting a few seconds, the DataFrame returns as expected.
<class 'pandas.core.frame.DataFrame'> Int64Index: 67458 entries, 0 to 67457 Data columns (total 35 columns): [...]
However, trying to pull large, heavier event data directly from ipython, after a long time it just works:
In [11]: events = sql.read_sql("SELECT * FROM events", conn) vagrant@data-science-toolbox :~$
and when I try from the iPython laptop I get a Dead kernel error
The kernel is dead, do you want to restart it? If you do not restart the kernel, you can save the laptop, but the current code will not work until the laptop is opened again.
Update # 1:
To better understand the size of the event table I'm trying to use, here are the number of records and the number of attributes for each:
In [11]: sql.read_sql("SELECT count(*) FROM events", conn) Out[11]: count 0 2711453 In [12]: len(sql.read_sql("SELECT * FROM events LIMIT 1", conn).columns) Out[12]: 18
Update # 2:
Memory is definitely a bottleneck for the current read_sql implementation: when outputting events and trying to start another instance of iPython, the result
vagrant@data-science-toolbox :~$ sudo ipython -bash: fork: Cannot allocate memory
Update # 3:
At first I tried with the implementation of read_sql_chunked , which would simply return an array of partial DataFrames:
def read_sql_chunked(query, conn, nrows, chunksize=1000): start = 0 dfs = [] while start < nrows: df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), conn) start += chunksize dfs.append(df) print "Events added: %s to %s of %s" % (start-chunksize, start, nrows)
and this works well, but when you try to combine DataFrames, the kernel dies again.
And this is after providing the VM with 2GB of RAM.
Based on Andy's explanation of the differences read_sql vs. read_csv in implementation and performance, the next thing I have tried is to add records to CSV and then read them all in a DataFrame:
event_dfs[0].to_csv(path+'new_events.csv', encoding='utf-8') for df in event_dfs[1:]: df.to_csv(path+'new_events.csv', mode='a', header=False, encoding='utf-8')
Writing to CSV again succeeds — a file of 657 MB, but reading from CSV never completes.
How can you get close to enough RAM to read, for example, a 657 MB CSV file, since 2 GB seems to be insufficient?
It seems like I lack a fundamental understanding of either DataFrames or psycopg2, but I'm stuck, I can’t even identify a bottleneck or where to optimize.
What is the right strategy for outputting large amounts of data from a remote server (postgres)?