Data locality with MapReduce and HDFS is very important (the same goes for Spark, HBase). I studied AWS and two options when deploying a cluster in the cloud:
The second option seems more attractive for various reasons, where the most interesting is the ability to scale and process separately and turn off processing when you do not need it (or rather, turn it on only when necessary). This is an example explaining the benefits of using S3.
What bothers me is the problem of data locality. If the data is stored in S3, it will need to be pulled out in HDFS every time the job starts. My question is: how big is this problem, and is it still worth it?
What comforts me is that I will only retrieve data for the first time, and then all of the following jobs will have intermediate results locally.
I hope to get an answer from someone with practical experience in this. Thanks.
source share