To answer this question, we need to compare the map with mapPartitions / mapPartitionsWithIndex (mapPartitions and mapPartitionsWithIndex pretty much do the same thing except mapPartitionsWithIndex, you can track which section is being processed).
Now mapPartitions and mapPartitionsWithIndex are used to optimize the performance of your application. Just for understanding, let me say that all the elements in your RDD are XML elements, and you need a parser to handle each of them. Therefore, you need to take an instance of the good parser class to move forward. You can do this in two ways:
map + foreach: In this case, an instance of the parser class will be created for each element, the element will be processed, and then the instance will be destroyed on time, but this instance will not be for other elements. Therefore, if you are working with an RDD of 12 elements distributed between 4 sections, an instance of the parser will be created 12 times. And, as you know, creating an instance is a very expensive operation, so it will take time.
mapPartitions / mapPartitionsWithIndex: These two methods can solve this situation a bit. mapPartitions / mapPartitionsWithIndex works with sections, not elements (please donβt get me wrong, all elements will be processed). These methods will instantiate the parser once for each section. And since you have only 4 sections, the parser instance will be created 4 times (for this example, 8 times smaller than the map). But the function that you pass to these methods should take an Iterator object (in order to immediately take all the elements of the section as input). Thus, in the case of mapPartitions and mapPartitionsWithIndex, an instance of the parser will be created, all elements for the current section will be processed, and then the instance will be destroyed later by the GC. And you will notice that they can significantly improve the performance of your application.
So, the bottom line is whenever you see that some operations are common to all elements, and in general you could do it once and process all of them, it is better to go with mapPartitions / mapPartitionsWithIndex.
Please find the two links below for explanation with sample code: https://bzhangusc.wordpress.com/2014/06/19/optimize-map-performamce-with-mappartitions/ http://apachesparkbook.blogspot.in/2015/ 11 / mappartition-example.html