Can PySpark work with numpy arrays?

I tried to execute the following commands in a pyspark session:

>>> a = [1,2,3,4,5,6,7,8,9,10]
>>> da = sc.parallelize(a)
>>> da.reduce(lambda a, b: a + b)

Everything went perfectly. I got the expected answer (which is 55). Now I'm trying to do the same, but using numpy arrays instead of Python lists:

>>> import numpy
>>> a = numpy.array([1,2,3,4,5,6,7,8,9,10])
>>> da = sc.parallelize(a)
>>> da.reduce(lambda a, b: a + b)

As a result, I get a lot of errors. To be more specific, I see the following error several times in the error message:

ImportError: No module named numpy.core.multiarray

Not something is not installed, is not my cluster or pyspark unable to work with the numpy array at a fundamental level?

+4
source share
1 answer

I had similar problems. I did below and solved the problem:

pip uninstall numpy
pip install numpy
pip install nose
0
source

Source: https://habr.com/ru/post/1618295/


All Articles