Many problems are not parquet specific, but S3 is not a file system, even though the APIs are trying to make it look like that. Many nominally low cost transactions accept multiple HTTPS requests with subsequent delays.
Regarding JIRA
- HADOOP-11694 ; S3A Phase II is all you get in Hadoop 2.8. Most of this is already in HDP2.5, and yes, it has significant advantages.
- HADOOP-13204 : A list of tasks to be completed.
- As for the spark (and the hive), using
rename() to do the job is a killer. It was used at the end of tasks and assignments, as well as at a checkpoint. The more output you create, the more time it takes to complete. s3guard work will include a zero- renowned commander, but it will take care and time to move something into it.
Parquet? pushdown works, but there are several more options to speed things up. I list them and others at: http://www.slideshare.net/steve_l/apache-spark-and-object-stores
source share