Spark Performance Improvements by Sorting Parquet Sorted Files

Question

Spark Performance Improvements by Sorting Parquet Sorted Files

Selected data will run faster if the DataFrame is sorted before being saved as Parquet files.

Suppose we have the following peopleDfDataFrame (pretend that the sample, and the real one is 20 billion rows):

+-----+----------------+
| age | favorite_color |
+-----+----------------+
|  54 | blue           |
|  10 | black          |
|  13 | blue           |
|  19 | red            |
|  89 | blue           |
+-----+----------------+

Let them write out sorted and unsorted versions of these DataFrame files in Parquet.

peopleDf.write.parquet("s3a://some-bucket/unsorted/")
peopleDf.sort($"favorite_color").write.parquet("s3a://some-bucket/sorted/")

Is there any performance gain when reading in sorted data and doing data-based retrievals favorite_color?

val pBlue1 = spark.read.parquet("s3a://some-bucket/unsorted/").filter($"favorite_color" === "blue")

// is this faster?

val pBlue2 = spark.read.parquet("s3a://some-bucket/sorted/").filter($"favorite_color" === "blue")

+4

sorting apache-spark parquet

Powers Nov 14 '16 at 0:08

source share

1 answer

user6022341 · Answer 1 · 2016-11-14T05:05:39+0000

Sorting provides several advantages:

.
.

, , :

node

Spark Performance Improvements by Sorting Parquet Sorted Files

More articles: