How to do real-time downloads in Amazon Redshift?

Question

How to do real-time downloads in Amazon Redshift?

We are evaluating Amazon Redshift for real-time data warehousing.

Data will be broadcast and processed through the Java service and must be stored in a database. We process line by line (in real time) and we will only insert one line per transaction.

What is best for loading real-time data on Amazon Redshift?

Should we use JDBC and execute INSERT INTO or try to use Kinesis Firehose or, possibly, AWS Lambda?

I am interested in using one of these services because both will use Amazon S3 as their middle layer and execute the COPY , which is suitable for large datasets, not single-line inserts.

+6

amazon-s3 amazon-redshift aws-lambda data-warehouse aws-kinesis-firehose

fenix Jan 14 '17 at 19:15

source share

2 answers

The best option is Kinesis Firehose, which works on batches of events. You write events in Firehose, one after another, and it executes it in an optimal way, based on your definition. You can determine how many minutes will be executed for events or the packet size in MB. You may be able to insert an event into Redshift faster with INSERT, but this method does not scale. COPY is designed to work on almost every scale.

+3

Guy Jan 15 '17 at 6:07

source share

John rotenstein · Accepted Answer · 2017-01-15T08:18:44+0000

inefficient use of separate INSERT with Amazon Redshift. It is designed as a Data Warehouse , providing very fast SQL queries. This is not a transaction database where data is often updated and inserted.

Best practice is to load lots (or micropackets ) using the COPY . Kinesis Firehose uses this method. This is much more efficient since multiple nodes are used to load data simultaneously.

If you're serious about real-time data processing, then Amazon Redshift might not be the best database. Consider using a traditional SQL database (such as provided by Amazon RDS), a NoSQL database (such as Amazon DynamoDB), or even Elasticsearch. You should use Redshift only if your focus is on reporting large amounts of data, usually involving many compounds in a table.

As mentioned in Amazon Redshift Best Practices to download data :

If the COPY command is not an option and you need SQL inserts, use multi-line inserts whenever possible. Data compression is inefficient when you add data on only one row or several rows at a time.

How to do real-time downloads in Amazon Redshift?

More articles: