Psycopg2 is equivalent to the d dWWriteTable command and gets more performance from python code

I just checked the time to load the data array from csv, creating a database in Postgres and writing a table to it in python and R.

I was surprised that the times were very similar:

Python code first: (e.g.

)
import timeit tic = timeit.default_timer() tic4 = timeit.default_timer() import xlrd as xl import psycopg2 as pq import os import pandas as pd import numpy as np import csv from pprint import pprint as pp perf_dir = '/myhomedir' toc4=timeit.default_timer() # Create the database tic1= timeit.default_timer() os.system('dropdb ptest') os.system('createdb ptest') # connect to the database cn = pq.connect('dbname=ptest user=me') cr = cn.cursor() toc1=timeit.default_timer() # Create the tables: ### load csvs tic2=timeit.default_timer() id_files = ('di1','di2','di.c') id_files = [i+r'.csv' for i in id_files] id1 = csv.reader(open(os.path.join(perf_dir,id_files[0])),delimiter='\t') id1 = [i for i in id1] id2 = csv.reader(open(os.path.join(perf_dir,id_files[1]))) id2 = [i for i in id2] id3 = csv.reader(open(os.path.join(perf_dir,id_files[2])),delimiter='\t') id3 = [i for i in id3] id3 = [i[1:3] for i in id3] toc2=timeit.default_timer() # create tables and fill ### id1 fund classifications tic3=timeit.default_timer() cr.execute('CREATE TABLE id1 (%s varchar, %s int PRIMARY KEY, %s int, %s int, %s varchar)' % tuple(id1[0])) FLDS = 'INSERT INTO id1 (%s,%s,%s,%s,%s) VALUES' % tuple(id1[0]) SQL = FLDS + ' (%s,%s,%s,%s,%s)' for i in range(1,len(id1)): data = tuple(id1[i]) cr.execute(SQL,data) ### id2 portfolio group classifications - reference only cr.execute('CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)' % tuple(id2[0])) SQL = 'INSERT INTO id2 (%s,%s,%s) VALUES' % tuple(id2[0]) + ' (%s,%s,%s)' for i in range(1,len(id2)): data = tuple(id2[i]) cr.execute(SQL,data) ### id3 value variable classifications cr.execute('CREATE TABLE id3 (%s varchar,%s varchar)' % tuple(id3[0])) SQL = 'INSERT INTO id3 VALUES(%s,%s)' for i in range(1,len(id3)): data = tuple(id3[i]) cr.execute(SQL,data) cn.commit() # Timing block - will be commented out in final code toc3=timeit.default_timer() toc = timeit.default_timer() time = (toc - tic) time1 = toc1 - tic1 time2 = toc2 - tic2 time3 = toc3 - tic3 time4 = toc4 - tic4 print('Overall time: %s' % time) print('dB create & connect time: %s' % time1) print('Load id csvs time: %s' % time2) print('Create tables and write to db time: %s' % time3) print('Time to import libraries: %s' % time4) 

and R code (which is much more readable)

 tic = proc.time() library(RPostgreSQL) tic1 = proc.time() system('dropdb ptest1') system('createdb ptest1') drv = dbDriver("PostgreSQL") con = dbConnect(drv, dbname='ptest1') toc1 = proc.time() time1 = toc1 - tic1 tic2 = proc.time() id.1 = read.csv('/myhomedir/di1.csv',stringsAsFactors=F,sep='\t') id.2 = read.csv('/myhomedir/di2.csv',stringsAsFactors=F) id.3 = read.csv('/myhomedir/di.c.csv',stringsAsFactors=F,sep='\t') id.3 = id.3[,-1] toc2 = proc.time() time2 = toc2 - tic2 tic3 = proc.time() dbWriteTable(con,'id1',id.1) dbWriteTable(con,'id2',id.2) dbWriteTable(con,'id3',id.3) toc3 = proc.time() time3 = toc3 - tic3 toc = proc.time() time = toc - tic tyme = rbind(time1,time2,time3,time) tyme = data.frame(Function=c('Create & Connect to DB',"Load CSV for save","Write Table to DB",'Overall Time'),tyme) 

I was very surprised at how close the times are for the two. (I read terribly when R was slow and Python was very fast)

For python

 >>> Overall time: 2.48381304741 dB create & connect time: 1.96832108498 Load id csvs time: 0.000378847122192 Create tables and write to db time: 0.35303401947 Time to import libraries: 0.162075042725 

and for R

  Function user.self sys.self elapsed user.child sys.child time1 Create & Connect to DB 0.112 0.016 1.943 0.06 0.004 time2 Load CSV for save 0.008 0.000 0.006 0.00 0.000 time3 Write Table to DB 0.096 0.004 0.349 0.00 0.000 time Overall Time 0.376 0.028 2.463 0.06 0.004 

I was wondering if this is due to the fact that I INSERT building a string in time in the python table version.

Therefore, the main question is whether there is an equivalent in python for the dbWriteTable block in R code and will it speed up the work?

The second supporting question is that there is something clearly wrong in the code, which can slow things down.

Happy to provide a csv sample if this helps.

Not wanting to start a fiery war on R v Python, I just wanted to know how I can make my code faster.

thanks

+4
source share
1 answer

Try the following:

  id2 = csv.reader(open(os.path.join(perf_dir,id_files[1]))) h = tuple(next(id2)) create = '''CREATE TABLE id2 (%s varchar, %s int PRIMARY KEY, %s int)''' % h insert = '''INSERT INTO id2 (%s, %s, %s) VALUES (%%s, %%s, %%s)''' % h ... cr.executemany(insert, id2) 

Pass the sql command only once for all rows. In addition to the performance overhead when issuing additional SQL commands, note that the size of the sql string itself (in bytes) probably overshadows the size of the actual data under test.

Using id2 as a generator without an array should also avoid loading the entire data set into memory.

+3
source

Source: https://habr.com/ru/post/1486732/


All Articles