Spark Performance Improvements by Sorting Parquet Sorted Files

Selected data will run faster if the DataFrame is sorted before being saved as Parquet files.

Suppose we have the following peopleDfDataFrame (pretend that the sample, and the real one is 20 billion rows):

+-----+----------------+
| age | favorite_color |
+-----+----------------+
|  54 | blue           |
|  10 | black          |
|  13 | blue           |
|  19 | red            |
|  89 | blue           |
+-----+----------------+

Let them write out sorted and unsorted versions of these DataFrame files in Parquet.

peopleDf.write.parquet("s3a://some-bucket/unsorted/")
peopleDf.sort($"favorite_color").write.parquet("s3a://some-bucket/sorted/")

Is there any performance gain when reading in sorted data and doing data-based retrievals favorite_color?

val pBlue1 = spark.read.parquet("s3a://some-bucket/unsorted/").filter($"favorite_color" === "blue")

// is this faster?

val pBlue2 = spark.read.parquet("s3a://some-bucket/sorted/").filter($"favorite_color" === "blue")
+4
source share
1 answer

Sorting provides several advantages:

  • .
  • .

, , :

+2

Source: https://habr.com/ru/post/1660640/


All Articles