How to load already broken and sorted data in Apache Spark

Question

How to load already broken and sorted data in Apache Spark

We use Spark 2.0.2 (PySpark) to split and sort billions of events for further processing. Events are divided into user sections and inside a section sorted by timestamp. Events are stored in Avro format. Flow processing is a Spark application (PySpark) and should benefit from such splitting and sorting.

I want to know how a top-down application can tell Spark that the data it loads (RDD / Dataframe) is already partitioned and sorted into a section. I can specify the partitioning and sorting inside the section, I assume that Spark will shuffle and sort, because it does not know the data layout. It can be expensive because we are talking about billions of events. I want to avoid this situation. How can I achieve this?

Thanks - Rupesh

+5

apache-spark pyspark

Rupesh mane Feb 16 '17 at 5:39

source share

No one has answered this question yet.

See similar questions:

2

How to reliably record and restore partitioned data

or similar:

198

Spark - repartition () vs coalesce ()