Parquet support as an input / output format when working with S3

I saw a number of questions describing problems when working with S3 in Spark:

many specifically describe problems with Parquet files:

as well as some external sources related to other problems with the Spark - S3 - Parquet combinations. It makes me think that either S3 with Spark, or this full combination may not be the best choice.

Am I into something here? Can anyone give an authoritative answer explaining:

  • Current status of Parquet support with an emphasis on S3.
  • Can Spark (SQL) takes full advantage of Parquet features such as section cropping, predicate (including deeply nested schemas), and Parquet metadata. All of these features work as expected on S3 (or compatible storage solutions).
  • Current events and open tickets JIRA.
  • Are there any configuration options to be aware of when using these three together?
+5
source share
1 answer

Many problems are not parquet specific, but S3 is not a file system, even though the APIs are trying to make it look like that. Many nominally low cost transactions accept multiple HTTPS requests with subsequent delays.

Regarding JIRA

  • HADOOP-11694 ; S3A Phase II is all you get in Hadoop 2.8. Most of this is already in HDP2.5, and yes, it has significant advantages.
  • HADOOP-13204 : A list of tasks to be completed.
  • As for the spark (and the hive), using rename() to do the job is a killer. It was used at the end of tasks and assignments, as well as at a checkpoint. The more output you create, the more time it takes to complete. s3guard work will include a zero- renowned commander, but it will take care and time to move something into it.

Parquet? pushdown works, but there are several more options to speed things up. I list them and others at: http://www.slideshare.net/steve_l/apache-spark-and-object-stores

+3
source

Source: https://habr.com/ru/post/1262800/


All Articles