Why Redshift COPY queries use (significantly) more disk space for tables with a sort key

I have a large data set on S3 in the form of several hundred CSV files, which in total are ~ 1.7 TB (uncompressed). I am trying to copy it to an empty table in a Redshift cluster.

The cluster is empty (there are no other tables) and has 10 dw2.large nodes. If I set the sort key in the table, the copy commands will use all available disk space about 25% of the path and abort. If there is no sort key, the copy succeeds and never uses more than 45% of the free disk space. This behavior is consistent with whether I have also installed the distribution key.

I really don't know why this is happening or was expected. Has anyone seen this behavior? If so, do you have any suggestions on how to get around this? One idea would be to try to import each file individually, but I would like to find a way to allow Redshift to make a deal with this part and do it all in one request.

+5
source share
2 answers

Got a response to this from the Redshift team. The cluster requires at least 2.5x free space, which will be used as temporary space for sorting. You can grow your cluster, copy data and resize it.

+7
source

Each dw2.large block has 0.16 TB of disk space. When you said that you have a cluster of 10 nodes, the total area is about 1.6 TB. You mentioned that you have about 1.6 TB of raw data (uncompressed) to load at redshift.

When you load data into redshift using copy commands, redshift automatically compresses your data and loads it. as soon as you load some db table you can see compression encoding on request

Select "column", type, encoding from pg_table_def where tablename = 'my_table_name' 

Once you upload your data when the table does not have a sort key. See what compression is applied. I suggested that you drop and create a table every time you load data for your testing. Thus, compression coding will be analyzed each time. When you load your table using copy commands, see below link and fire script to determine table size

http://docs.aws.amazon.com/redshift/latest/dg/c_analyzing-table-design.html

Because when you apply the sort key to your table and load the data, the sort key also takes up some disk space.

Since a table with a missing sort key requires less disk space than a table with a sort key.

You need to make sure that compression is applied to the table.

When we use the sort key, it needs more storage space. When you apply the sort key, you also need to check if you are loading the data in sorted order so that the data is stored in a sorted way. To do this, we need to avoid the vacuum command to sort the table after loading the data.

0
source

Source: https://habr.com/ru/post/1204531/


All Articles