I am trying to reproduce the name given here https://www.cloudera.com/documentation/enterprise/5-7-x/topics/spark_python.html for importing external packages into pypspark. But he fails.
My code is:
spark_distro.py
from pyspark import SparkContext, SparkConf def import_my_special_package(x): from external_package import external return external.fun(x) conf = SparkConf() sc = SparkContext() int_rdd = sc.parallelize([1, 2, 3, 4]) int_rdd.map(lambda x: import_my_special_package(x)).collect()
external_package.py
class external: def __init__(self,in): self.in = in def fun(self,in): return self.in*3
fix the send command:
spark-submit \ --master yarn \ /path to script/spark_distro.py \ --py-files /path to script/external_package.py \ 1000
Actual error:
Actual: vs = list(itertools.islice(iterator, batch)) File "/home/gsurapur/pyspark_examples/spark_distro.py", line 13, in <lambda> File "/home/gsurapur/pyspark_examples/spark_distro.py", line 6, in import_my_special_package ImportError: No module named external_package
Expected Result:
[3,6,9,12]
I also tried the sc.addPyFile option and it does not work with the same problem. Please help me find the problem. Thanks in advance.
source share