SQL - Optimizing the performance of bulk inserts and large joins?

I am doing ETL for log files in a PostgreSQL database and want to learn more about the various approaches used to optimize the performance of loading data into a simple star schema.

To put the question in context, here is an overview of what I am currently doing:

  • Discard all foreign key and unique constraints
  • Import data (~ 100 million records)
  • Recreate the constraints and run the analysis in the fact table.

Data is imported by loading from files. For each file:

1) Load data into a temporary table using COPY (PostgreSQL Bulk Upload Tool)

2) Update each of the 9 measurement tables with any new data, using the insert for each of them, for example:

INSERT INTO host (name)
SELECT DISTINCT host_name FROM temp_table
EXCEPT
SELECT name FROM host;
ANALYZE host;

INSERT ( ? , ).

3) 9- :

INSERT INTO event (time, status, fk_host, fk_etype, ... ) 
SELECT t.time, t.status, host.id, etype.id ... 
FROM temp_table as t 
JOIN host ON t.host_name = host.name
JOIN url ON t.etype = etype.name
... and 7 more joins, one for each dimension table

, ?

+3
2

, , , , , , . . 2 , . Postgres " " , "select different except select" . , , .

+1

2 , ( , ), "" 9- .

sproc ; insertXXXFact(...), sprocs ( ) getOrInsertXXXDim, XXX - . sprocs , ( ), , . , , insert into XXXFact values (DimPKey1, DimPKey2, ... etc.)

, getOrInsertXXX sprocs, , , , .

0

Source: https://habr.com/ru/post/1714422/


All Articles