Amazon Redshift, how to copy from s3 and set job_id

Amazon Redshift provides the ability to load table data from s3 objects using the Copy command. Their way is to use the copy command, but also set an extra "col = CONSTANT" for each row inserted.

I want to set job_id (which is not in the source data) on each copied row, and I think it would be a shame to do several million pastes, so that each row has a job attribute when the β€œcopy” gets me 99% of what there with much better performance.

Maybe there is a smarter solution?

+4
source share
2 answers

If you want all your rows added in the same COPY command to have the same job_id value, you can copy the data to the staging table, then add the job_id column to this table, and then paste all the data from the staging table into the final table like:

CREATE TABLE destination_staging (LIKE destination); ALTER TABLE destination_staging DROP COLUMN job_id; COPY destination_staging FROM 's3://data/destination/(...)' (...) ALTER TABLE destination_staging ADD COLUM job_id INT DEFAULT 42; INSERT INTO destination SELECT * FROM destination_staging ORDER BY sortkey_column; DROP TABLE destination_staging; ANALYZE TABLE destination; VACUUM destination; 

ANALYZE and VACUUM are not needed, but it is highly recommended to update the query analyzer and put all the new data in the correct positions.

+8
source

There seems to be no way to perform post / preprocessing with the COPY . Therefore, it’s best to pre-process the files you are going to use COPY in Redshift, add jobid and then upload them to Redshift.

0
source

Source: https://habr.com/ru/post/1480509/


All Articles