How to load already broken and sorted data in Apache Spark

We use Spark 2.0.2 (PySpark) to split and sort billions of events for further processing. Events are divided into user sections and inside a section sorted by timestamp. Events are stored in Avro format. Flow processing is a Spark application (PySpark) and should benefit from such splitting and sorting.

I want to know how a top-down application can tell Spark that the data it loads (RDD / Dataframe) is already partitioned and sorted into a section. I can specify the partitioning and sorting inside the section, I assume that Spark will shuffle and sort, because it does not know the data layout. It can be expensive because we are talking about billions of events. I want to avoid this situation. How can I achieve this?

Thanks - Rupesh

+5
source share

Source: https://habr.com/ru/post/1264284/


All Articles