I am doing ETL for log files in a PostgreSQL database and want to learn more about the various approaches used to optimize the performance of loading data into a simple star schema.
To put the question in context, here is an overview of what I am currently doing:
- Discard all foreign key and unique constraints
- Import data (~ 100 million records)
- Recreate the constraints and run the analysis in the fact table.
Data is imported by loading from files. For each file:
1) Load data into a temporary table using COPY (PostgreSQL Bulk Upload Tool)
2) Update each of the 9 measurement tables with any new data, using the insert for each of them, for example:
INSERT INTO host (name)
SELECT DISTINCT host_name FROM temp_table
EXCEPT
SELECT name FROM host;
ANALYZE host;
INSERT ( ? , ).
3) 9- :
INSERT INTO event (time, status, fk_host, fk_etype, ... )
SELECT t.time, t.status, host.id, etype.id ...
FROM temp_table as t
JOIN host ON t.host_name = host.name
JOIN url ON t.etype = etype.name
... and 7 more joins, one for each dimension table
, ?