S3 and EMR

Question

S3 and EMR

Data locality with MapReduce and HDFS is very important (the same goes for Spark, HBase). I studied AWS and two options when deploying a cluster in the cloud:

EC2
EMR + S3

The second option seems more attractive for various reasons, where the most interesting is the ability to scale and process separately and turn off processing when you do not need it (or rather, turn it on only when necessary). This is an example explaining the benefits of using S3.

What bothers me is the problem of data locality. If the data is stored in S3, it will need to be pulled out in HDFS every time the job starts. My question is: how big is this problem, and is it still worth it?

What comforts me is that I will only retrieve data for the first time, and then all of the following jobs will have intermediate results locally.

I hope to get an answer from someone with practical experience in this. Thanks.

+5

amazon-s3 amazon-web-services amazon-ec2 hadoop amazon-emr

Marko Jun 01 '17 at 9:45

source share

No one has answered this question yet.

See related questions:

18

Extremely Slow S3 Recording Time from EMR / Spark

nine

Running EMR Spark with Multiple S3 Accounts