Using COPY FROM stdin to load tables, read the input file only once

I have a large (~ 60 million lines) file with a fixed width with ~ 1800 entries per line.

I need to upload this file to 5 different tables in a Postgres 8.3.9 instance.

My dilemma is that since the file is so large, I would like to read it only once.

It's simple enough using INSERT or COPY, as usual, but I'm trying to increase the download speed by including my COPY FROM statements in a transaction that includes TRUNCATE - avoiding registration, which should give significant (according to http: //www.cirrusql. com / node / 3 ). As far as I understand, you can turn off logging in Postgres 9.x - but I do not have this option in 8.3.9.

Below is a list of the script that I read twice in the input file I want to avoid ... any ideas on how I could do this by reading the input file only once? No need to be bash - I also tried using psycopg2, but couldn't figure out how to pass the file stream to the COPY statement, as I do below. I cannot copy from a file because I need to parse it on the fly.

#!/bin/bash table1="copytest1" table2="copytest2" #note: $1 refers to the first argument used when invoking this script #which should be the location of the file one wishes to have python #parse and stream out into psql to be copied into the data tables ( echo 'BEGIN;' echo 'TRUNCATE TABLE ' ${table1} ';' echo 'COPY ' ${table1} ' FROM STDIN' echo "WITH NULL AS '';" cat $1 | python2.5 ~/parse_${table1}.py echo '\.' echo 'TRUNCATE TABLE ' ${table2} ';' echo 'COPY ' ${table2} ' FROM STDIN' echo "WITH NULL AS '';" cat $1 | python2.5 ~/parse_${table2}.py echo '\.' echo 'COMMIT;' ) | psql -U postgres -h chewy.somehost.com -p 5473 -d db_name exit 0 

Thanks!

+4
source share
2 answers

You can use named pipes instead of your anonymous pipe. With this concept, your python script could populate tables through various psql processes with relevant data.

Create channels:

 mkfifo fifo_table1 mkfifo fifo_table2 

Run psql instances:

 psql db_name < fifo_table1 & psql db_name < fifo_table2 & 

Your python script will look something like this (Pseudocode):

 SQL_BEGIN = """ BEGIN; TRUNCATE TABLE %s; COPY %s FROM STDIN WITH NULL AS ''; """ fifo1 = open('fifo_table1', 'w') fifo2 = open('fifo_table2', 'w') bigfile = open('mybigfile', 'r') print >> fifo1, SQL_BEGIN % ('table1', 'table1') #ugly, with python2.6 you could use .format()-Syntax print >> fifo2, SQL_BEGIN % ('table2', 'table2') for line in bigfile: # your code, which decides where the data belongs to # if data belongs to table1 print >> fifo1, data # else print >> fifo2, data print >> fifo1, 'COMMIT;' print >> fifo2, 'COMMIT;' fifo1.close() fifo2.close() 

This may not be the most elegant solution, but it should work.

+2
source

Why use COPY for the second table? I would suggest that doing:

  INSERT INTO table2 (...)
 SELECT ...
 FROM table1;

will be faster than using COPY.

Edit
If you need to import different rows into different tables, but from the same source file, it is possible to insert everything into an intermediate table, and then insert rows from there into the target tables faster:

Import the .whole * text file into one staging table:

  COPY staging_table FROM STDIN ...;

After this step, the entire input file is in staging_table

Then copy the rows from the staging table to the individual target tables, selecting only those that correspond to the corresponding table:

  INSERT INTO table_1 (...)
 SELECT ...
 FROM staging_table
 WHERE (conditions for table_1);

 INSERT INTO table_2 (...)
 SELECT ...
 FROM staging_table
 WHERE (conditions for table_2);

This, of course, is only possible if you have enough space in your database to support the staging table.

+2
source

Source: https://habr.com/ru/post/1342425/


All Articles