The relationship between iterable and arrays in Spark

I notice that if I applied mapPartitionson RDD, the sections will get an iterable object. Inside the function, mapPartitionsI call the toArrayiteration member function to convert this iterable to an Array. Does the call toArrayinvoke copying or just start to reference the same piece of memory as the array? If this is related to copying, what are the ways to prevent copying?

+4
source share
1 answer

One important amendment to your question - the partition data structure that is open in time mapPartitionsis Iterator, not Iterable. Here's the difference in the interface:

  • Iterator next() hasNext(), . next() ( ).
  • Iterable Iterator, . , .

, Iterator . , next(). Spark (sc.textFile), .

iterator.toArray, , , . (Spark , ), ( , Int) ( AnyRef, Array[_]). .

, - , , . - GC, , !

+3

Source: https://habr.com/ru/post/1664438/


All Articles