Can PySpark work with numpy arrays?

Question

I tried to execute the following commands in a pyspark session:

>>> a = [1,2,3,4,5,6,7,8,9,10]
>>> da = sc.parallelize(a)
>>> da.reduce(lambda a, b: a + b)

Everything went perfectly. I got the expected answer (which is 55). Now I'm trying to do the same, but using numpy arrays instead of Python lists:

>>> import numpy
>>> a = numpy.array([1,2,3,4,5,6,7,8,9,10])
>>> da = sc.parallelize(a)
>>> da.reduce(lambda a, b: a + b)

As a result, I get a lot of errors. To be more specific, I see the following error several times in the error message:

ImportError: No module named numpy.core.multiarray

Not something is not installed, is not my cluster or pyspark unable to work with the numpy array at a fundamental level?

+4

Roman Dec 02 '15 at 14:35

1 answer

Obaid · Answer 1 · 2016-05-30T17:24:22+0000

I had similar problems. I did below and solved the problem:

pip uninstall numpy
pip install numpy
pip install nose