How does Spark decide how to split RDD?

Question

How does Spark decide how to split RDD?

Suppose I create such an RDD (I use Pyspark):

list_rdd = sc.parallelize(xrange(0, 20, 2), 6)

then I print the partitioned elements using the glom() method and get

 [[0], [2, 4], [6, 8], [10], [12, 14], [16, 18]]

How did Spark decide how to split my list? Where does this particular selection of elements come from? This could link them differently, leaving some other elements besides 0 and 10 to create 6 requested partitions. In the second run, the sections are the same.

Using a wider range, with 29 elements, I get sections in a two-element template, followed by three elements:

 list_rdd = sc.parallelize(xrange(0, 30, 2), 6) [[0, 2], [4, 6, 8], [10, 12], [14, 16, 18], [20, 22], [24, 26, 28]]

Using a smaller range of 9 elements, I get

 list_rdd = sc.parallelize(xrange(0, 10, 2), 6) [[], [0], [2], [4], [6], [8]]

So, what I conclude is that Spark generates sections, breaking the list into a configuration where the smallest possible is followed by large collections and is repeated.

The question is, is there a reason for this choice that is very elegant, but does it also provide performance benefits?

+5

apache-spark pyspark rdd

mar tin Mar 04 '16 at 13:50

source share

1 answer

Justin pihony · Accepted Answer · 2016-03-04T14:23:53+0000

If you do not specify a specific delimiter, this will be "random" because it depends on the specific implementation of this RDD. In this case, you can go to ParallelCollectionsRDD to delve further into it.

getPartitions defined as:

 val slices = ParallelCollectionRDD.slice(data, numSlices).toArray slices.indices.map(i => new ParallelCollectionPartition(id, i, slices(i))).toArray

where slice commented out as (reformatted to fit better):

 /** * Slice a collection into numSlices sub-collections. * One extra thing we do here is to treat Range collections specially, * encoding the slices as other Ranges to minimize memory cost. * This makes it efficient to run Spark over RDDs representing large sets of numbers. * And if the collection is an inclusive Range, * we use inclusive range for the last slice. */

Note that there are some considerations regarding memory. So, again, this will be implementation specific.

How does Spark decide how to split RDD?

More articles: