Retrieve large amounts of data from a remote server, in a DataFrame

Question

Retrieve large amounts of data from a remote server, in a DataFrame

To indicate as much / as much as possible, I am trying to pull some data stored on a remote postgres server (heroku) into a pandas DataFrame using psycopg2 to connect.

I’m interested in two specific tables, users and events, and the connection works fine, because when pulling user data

import pandas.io.sql as sql # [...] users = sql.read_sql("SELECT * FROM users", conn)

after waiting a few seconds, the DataFrame returns as expected.

 <class 'pandas.core.frame.DataFrame'> Int64Index: 67458 entries, 0 to 67457 Data columns (total 35 columns): [...]

However, trying to pull large, heavier event data directly from ipython, after a long time it just works:

 In [11]: events = sql.read_sql("SELECT * FROM events", conn) vagrant@data-science-toolbox :~$

and when I try from the iPython laptop I get a Dead kernel error

The kernel is dead, do you want to restart it? If you do not restart the kernel, you can save the laptop, but the current code will not work until the laptop is opened again.

Update # 1:

To better understand the size of the event table I'm trying to use, here are the number of records and the number of attributes for each:

 In [11]: sql.read_sql("SELECT count(*) FROM events", conn) Out[11]: count 0 2711453 In [12]: len(sql.read_sql("SELECT * FROM events LIMIT 1", conn).columns) Out[12]: 18

Update # 2:

Memory is definitely a bottleneck for the current read_sql implementation: when outputting events and trying to start another instance of iPython, the result

 vagrant@data-science-toolbox :~$ sudo ipython -bash: fork: Cannot allocate memory

Update # 3:

At first I tried with the implementation of read_sql_chunked , which would simply return an array of partial DataFrames:

 def read_sql_chunked(query, conn, nrows, chunksize=1000): start = 0 dfs = [] while start < nrows: df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), conn) start += chunksize dfs.append(df) print "Events added: %s to %s of %s" % (start-chunksize, start, nrows) # print "concatenating dfs" return dfs event_dfs = read_sql_chunked("SELECT * FROM events", conn, events_count, 100000)

and this works well, but when you try to combine DataFrames, the kernel dies again.
And this is after providing the VM with 2GB of RAM.

Based on Andy's explanation of the differences read_sql vs. read_csv in implementation and performance, the next thing I have tried is to add records to CSV and then read them all in a DataFrame:

 event_dfs[0].to_csv(path+'new_events.csv', encoding='utf-8') for df in event_dfs[1:]: df.to_csv(path+'new_events.csv', mode='a', header=False, encoding='utf-8')

Writing to CSV again succeeds — a file of 657 MB, but reading from CSV never completes.

How can you get close to enough RAM to read, for example, a 657 MB CSV file, since 2 GB seems to be insufficient?

It seems like I lack a fundamental understanding of either DataFrames or psycopg2, but I'm stuck, I can’t even identify a bottleneck or where to optimize.

What is the right strategy for outputting large amounts of data from a remote server (postgres)?

+8

python pandas postgresql psycopg2

Marius butuc Sep 2 '14 at 23:17

source share

3 answers

try using pandas:

 mysql_cn = mysql.connector.connect(host='localhost', port=123, user='xyz', passwd='****', db='xy_db')** data= pd.read_sql('SELECT * FROM table;', con=mysql_cn) mysql_cn.close()

It worked for me.

0

Ch haxam Jul 05 '17 at 9:29

source share

Here is a basic cursor example that may be useful:

import psycopg2

note that we must import the Psycopg2 add-on library!

import psycopg2.extras

import sys

def main (): conn_string = "host =" localhost "dbname =" my_database "user =" postgres "password =" secret "" ### print the connection string that we will use to connect

 conn = psycopg2.connect(conn_string) ### HERE IS THE IMPORTANT PART, by specifying a name for the cursor ### psycopg2 creates a server-side cursor, which prevents all of the ### records from being downloaded at once from the server. cursor = conn.cursor('cursor_unique_name', cursor_factory=psycopg2.extras.DictCursor) cursor.execute('SELECT * FROM my_table LIMIT 1000') ### Because cursor objects are iterable we can just call 'for - in' on ### the cursor object and the cursor will automatically advance itself ### each iteration. ### This loop should run 1000 times, assuming there are at least 1000 ### records in 'my_table' row_count = 0 for row in cursor: row_count += 1 print "row: %s %s\n" % (row_count, row)

if name == " main ": main ()

0

Hillary murefu Oct 9 '19 at 6:21

source share

Andy hayden · Accepted Answer · 2014-09-03T05:20:35+0000

I suspect there are a few (related) things that cause a slowdown:

read_sql written in python, so it is slightly slower (especially compared to read_csv , which is written in cython - and carefully implemented for speed!), and it relies on sqlalchemy and not some (potentially much faster) C -DBAPI. The impulse to switch to sqlalchmey was to make this step easier in the future (as well as support for cross-sql platform).
You may have run out of memory since there are too many python objects in memory (this is because C-DBAPI is not used), but could potentially be solved ...

I think the immediate solution is a piece-based approach (and there is a function request for this work to be done in pandas read_sql and read_sql_table ).

EDIT: with pandas v0.16.2, this piece based approach is initially implemented in read_sql .

Since you are using postgres, you have access to the LIMIT and OFFSET requests , which makes chunking quite simple. (Am I right in thinking that they are not available in all sql languages?)

First get the number of rows (or estimate ) in your table:

 nrows = con.execute('SELECT count(*) FROM users').fetchone()[0] # also works with an sqlalchemy engine

Use this to iterate through the table (for debugging, you can add some print instructions to confirm that it works / didn't break!), And then combine the result:

 def read_sql_chunked(query, con, nrows, chunksize=1000): start = 1 dfs = [] # Note: could probably make this neater with a generator/for loop while start < nrows: df = pd.read_sql("%s LIMIT %s OFFSET %s" % (query, chunksize, start), con) dfs.append(df) return pd.concat(dfs, ignore_index=True)

Note: this assumes the database fits into memory! If this is not the case, you will need to work with every piece (mapreduce style) ... or invest in more memory!

Retrieve large amounts of data from a remote server, in a DataFrame

note that we must import the Psycopg2 add-on library!

More articles: