We use Spark 2.0.2 (PySpark) to split and sort billions of events for further processing. Events are divided into user sections and inside a section sorted by timestamp. Events are stored in Avro format. Flow processing is a Spark application (PySpark) and should benefit from such splitting and sorting.
I want to know how a top-down application can tell Spark that the data it loads (RDD / Dataframe) is already partitioned and sorted into a section. I can specify the partitioning and sorting inside the section, I assume that Spark will shuffle and sort, because it does not know the data layout. It can be expensive because we are talking about billions of events. I want to avoid this situation. How can I achieve this?
Thanks - Rupesh
source share