A spark map is only one task, while it must be parallel (PySpark)

Question

A spark map is only one task, while it must be parallel (PySpark)

I have an RDD with 7M elements with 10 normalized coordinates in each. I also have several centers, and I'm trying to match each record in the closest (Euclidean distance) center. The problem is that this generates only one task, which means that it does not parallelize. This is the form:

def doSomething(point,centers):
    for center in centers.value:
        if(distance(point,center)<1):
             return(center)
    return(None)

preppedData.map(lambda x:doSomething(x,centers)).take(5)

PrepedData RDD is cached and already evaluated, the doSomething function seems a lot easier than it really is, but it's the same principle. Centers is a list that has been broadcast. Why is this map in only one task?

Similar fragments of code in other projects are simply compared with + 100 tasks and run on all artists, this is one task for 1 artist. My work consists of 8 artists with 8 GB and 2 cores for each artist.

+1

parallel-processing apache-spark pyspark

Jan van der Vegt Jan 13 '16 at 12:12

source share

1 answer

Hamel Kothari · Accepted Answer · 2016-01-13T17:22:05+0000

This may be due to the conservative nature of the take () method. See the code in RDD.scala.

, RDD ( RDD , ), , . , , , .

RDD , - , RDD > 5 , . .

, , .

A spark map is only one task, while it must be parallel (PySpark)

More articles: