I tried to execute the following commands in a pyspark session:
>>> a = [1,2,3,4,5,6,7,8,9,10]
>>> da = sc.parallelize(a)
>>> da.reduce(lambda a, b: a + b)
Everything went perfectly. I got the expected answer (which is 55). Now I'm trying to do the same, but using numpy arrays instead of Python lists:
>>> import numpy
>>> a = numpy.array([1,2,3,4,5,6,7,8,9,10])
>>> da = sc.parallelize(a)
>>> da.reduce(lambda a, b: a + b)
As a result, I get a lot of errors. To be more specific, I see the following error several times in the error message:
ImportError: No module named numpy.core.multiarray
Not something is not installed, is not my cluster or pyspark unable to work with the numpy array at a fundamental level?