Quickly insert pandas DataFrame into Postgres DB using psycopg2

I am trying to insert a pandas DataFrame into Postgresql DB (9.1) in the most efficient way (using Python 2.7).
Using cursor.execute_many is very slow, so it's DataFrame.to_csv (buffer, ...) along with copy_from.
I have already found a lot! faster Internet solution ( http://eatthedots.blogspot.de/2008/08/faking-read-support-for-psycopgs.html ), which I adapted to work with pandas.
My code can be found below. My question is whether the method of this related question (using "copy from stdin with binary") can be ported to work with DataFrames, and if it will be much faster.
Use COPY FROM binary table with psycopg2
Unfortunately, my Python skills are not enough to understand the implementation of this approach.
This is my approach:

import psycopg2 import connectDB # this is simply a module that returns a connection to the db from datetime import datetime class ReadFaker: """ This could be extended to include the index column optionally. Right now the index is not inserted """ def __init__(self, data): self.iter = data.itertuples() def readline(self, size=None): try: line = self.iter.next()[1:] # element 0 is the index row = '\t'.join(x.encode('utf8') if isinstance(x, unicode) else str(x) for x in line) + '\n' # in my case all strings in line are unicode objects. except StopIteration: return '' else: return row read = readline def insert(df, table, con=None, columns = None): time1 = datetime.now() close_con = False if not con: try: con = connectDB.getCon() ###dbLoader returns a connection with my settings close_con = True except psycopg2.Error, e: print e.pgerror print e.pgcode return "failed" inserted_rows = df.shape[0] data = ReadFaker(df) try: curs = con.cursor() print 'inserting %s entries into %s ...' % (inserted_rows, table) if columns is not None: curs.copy_from(data, table, null='nan', columns=[col for col in columns]) else: curs.copy_from(data, table, null='nan') con.commit() curs.close() if close_con: con.close() except psycopg2.Error, e: print e.pgerror print e.pgcode con.rollback() if close_con: con.close() return "failed" time2 = datetime.now() print time2 - time1 return inserted_rows 
+4
source share
2 answers

Pandas There is now a .to_sql method in dataframes. Postgresql is not yet supported, but there is a patch for it that looks like it is working. See here and here .

+1
source

I have not tested the performance, but maybe you can use something like this:

  • Iterating over rows of a DataFrame, yielding a row representing the row (see below)
  • Convert this iterable to stream using, for example, Python: convert iterable to stream?
  • Finally, use psycopg copy_from in this thread.

To get the rows of a DataFrame, effectively use something like:

  def r(df): for idx, row in df.iterrows(): yield ','.join(map(str, row)) 
0
source

Source: https://habr.com/ru/post/1402929/


All Articles