I have an RDD with 7M elements with 10 normalized coordinates in each. I also have several centers, and I'm trying to match each record in the closest (Euclidean distance) center. The problem is that this generates only one task, which means that it does not parallelize. This is the form:
def doSomething(point,centers):
for center in centers.value:
if(distance(point,center)<1):
return(center)
return(None)
preppedData.map(lambda x:doSomething(x,centers)).take(5)
PrepedData RDD is cached and already evaluated, the doSomething function seems a lot easier than it really is, but it's the same principle. Centers is a list that has been broadcast. Why is this map in only one task?
Similar fragments of code in other projects are simply compared with + 100 tasks and run on all artists, this is one task for 1 artist. My work consists of 8 artists with 8 GB and 2 cores for each artist.
source
share