My data is basically a table that contains an ID column and a GROUP_ID column, among other "data".
At the first stage, I read CSV in Spark, do some processing to prepare the data for the second step, and write the data as parquet. The second step is a lot of groupBy('GROUP_ID') and Window.partitionBy('GROUP_ID').orderBy('ID') .
Now the goal is to avoid shuffling in the second step - to efficiently load data in the first step, since this is one timer.
Question Part 1: AFAIK, Spark preserves the partition when loading from the floor (which is actually the basis of any "optimized record consideration") - right?
I came up with three possibilities:
df.orderBy('ID').write.partitionBy('TRIP_ID').parquet('/path/to/parquet')df.orderBy('ID').repartition(n, 'TRIP_ID').write.parquet('/path/to/parquet')df.repartition(n, 'TRIP_ID').sortWithinPartitions('ID').write.parquet('/path/to/parquet')
I would set n so that the individual parquet files are ~ 100 MB.
Question Part 2: Is it right that the three options give βthe sameβ / similar results in relation to the goal (avoid shuffling in the second step)? If not, what is the difference? And which one is "better"?
Question. Part 3:. Which of the three options works best with respect to step 1?
Thank you for sharing your knowledge!
EDIT 2017-07-24
After performing some tests (writing and reading from the floor), it seems that Spark cannot restore the partitionBy and orderBy information by default in the second step. The number of partitions (as obtained from df.rdd.getNumPartitions() apparently determined by the number of cores and / or spark.default.parallelism (if installed), but not by the number of parquet partitions, so the answer to question 1 will be WRONG , but questions 2 and 3 will be irrelevant.
So it turns out that REAL QUESTION : is there any way to tell Spark that the data is already separated by column X and sorted by column Y