EC2 provides a very convenient scalable on-demand mechanism for performing distributed (parallel processes), and S3 provides a reliable storage service.
I tried to use EC2 nodes for the ETL process and analytics, this process requires a lot of data (100 GB - 1 TB), which arrive very quickly (and several times a day), and sufficient computing resources that will be available for a short duration.
The above design needs
- High speed / fast connection between S3 and EC2.
- The S3 β EC2 connection should also be reliable, since scheduling startup, data transfer, execution of processes and terminating nodes should be performed as soon as possible, not only to save costs, but also because SLAs are involved.
But for now
- The only way to pull data from S3 is apparently through http and therefore is limited by the load restrictions of EC2 nodes.
- In addition, the ingestion of data passes through the Internet and, therefore, can be unreliable for strict planning purposes, which requires adequate buffering for all tasks.
In a private data center installation, you can configure a faster (for example, 10 Gbps) leased line between the storage and physical nodes.
Are there any possible service options / options in the case of aws that may satisfy the above requirements.
source share