When to use mapParitions and mapPartitionsWithIndex?

Question

When to use mapParitions and mapPartitionsWithIndex?

PySpark documentation describes two functions:

mapPartitions(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD. >>> rdd = sc.parallelize([1, 2, 3, 4], 2) >>> def f(iterator): yield sum(iterator) >>> rdd.mapPartitions(f).collect() [3, 7]

AND...

 mapPartitionsWithIndex(f, preservesPartitioning=False) Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. >>> rdd = sc.parallelize([1, 2, 3, 4], 4) >>> def f(splitIndex, iterator): yield splitIndex >>> rdd.mapPartitionsWithIndex(f).sum() 6

In what cases are these functions trying to be solved? I do not understand why they will be required.

+5

apache-spark pyspark

Chris snow Nov 11 '15 at 17:09

source share

1 answer

Mrinal · Accepted Answer · 2017-01-28T15:18:40+0000

To answer this question, we need to compare the map with mapPartitions / mapPartitionsWithIndex (mapPartitions and mapPartitionsWithIndex pretty much do the same thing except mapPartitionsWithIndex, you can track which section is being processed).

Now mapPartitions and mapPartitionsWithIndex are used to optimize the performance of your application. Just for understanding, let me say that all the elements in your RDD are XML elements, and you need a parser to handle each of them. Therefore, you need to take an instance of the good parser class to move forward. You can do this in two ways:

map + foreach: In this case, an instance of the parser class will be created for each element, the element will be processed, and then the instance will be destroyed on time, but this instance will not be for other elements. Therefore, if you are working with an RDD of 12 elements distributed between 4 sections, an instance of the parser will be created 12 times. And, as you know, creating an instance is a very expensive operation, so it will take time.

mapPartitions / mapPartitionsWithIndex: These two methods can solve this situation a bit. mapPartitions / mapPartitionsWithIndex works with sections, not elements (please don’t get me wrong, all elements will be processed). These methods will instantiate the parser once for each section. And since you have only 4 sections, the parser instance will be created 4 times (for this example, 8 times smaller than the map). But the function that you pass to these methods should take an Iterator object (in order to immediately take all the elements of the section as input). Thus, in the case of mapPartitions and mapPartitionsWithIndex, an instance of the parser will be created, all elements for the current section will be processed, and then the instance will be destroyed later by the GC. And you will notice that they can significantly improve the performance of your application.

So, the bottom line is whenever you see that some operations are common to all elements, and in general you could do it once and process all of them, it is better to go with mapPartitions / mapPartitionsWithIndex.

Please find the two links below for explanation with sample code: https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/ http://apachesparkbook.blogspot.in/2015/ 11 / mappartition-example.html

When to use mapParitions and mapPartitionsWithIndex?

More articles: