Submit the PySpark job to the cluster with the argument '-py-files'

I tried to send the job using the GCS uri zip python files to use (via the -py-files argument) and the python file name as the value of the PY_FILE argument. This does not seem to work. Should I provide some relative path for the value of PY_FILE? PY_FILE is also included in zip. for example, in

gcloud beta dataproc jobs submit pyspark  --cluster clustername --py-files gcsuriofzip PY_FILE    

What should be the value of PY_FILE?

+4
source share
1 answer

That's a good question. To answer this question, I am going to use the PySpark wordcount example .

, test.py, , , - wordcount.py.zip, zip, wordcount.py , .

test.py :

import wordcount
import sys
if __name__ == "__main__":
    wordcount.wctest(sys.argv[1])

wordcount.py, :

...
from pyspark import SparkContext

...
def wctest(path):
    sc = SparkContext(appName="PythonWordCount")
...

Dataproc gcloud:

gcloud beta dataproc jobs submit pyspark  --cluster <cluster-name> \
--py-files gs://<bucket>/wordcount.py.zip gs://<bucket>/test.py \ 
gs://<bucket>/input/input.txt

<bucket> - ( ) , <cluster-name> - Dataproc.

+5

Source: https://habr.com/ru/post/1609015/


All Articles