Suppose I create such an RDD (I use Pyspark):
list_rdd = sc.parallelize(xrange(0, 20, 2), 6)
then I print the partitioned elements using the glom()
method and get
[[0], [2, 4], [6, 8], [10], [12, 14], [16, 18]]
How did Spark decide how to split my list? Where does this particular selection of elements come from? This could link them differently, leaving some other elements besides 0 and 10 to create 6 requested partitions. In the second run, the sections are the same.
Using a wider range, with 29 elements, I get sections in a two-element template, followed by three elements:
list_rdd = sc.parallelize(xrange(0, 30, 2), 6) [[0, 2], [4, 6, 8], [10, 12], [14, 16, 18], [20, 22], [24, 26, 28]]
Using a smaller range of 9 elements, I get
list_rdd = sc.parallelize(xrange(0, 10, 2), 6) [[], [0], [2], [4], [6], [8]]
So, what I conclude is that Spark generates sections, breaking the list into a configuration where the smallest possible is followed by large collections and is repeated.
The question is, is there a reason for this choice that is very elegant, but does it also provide performance benefits?