Submit the PySpark job to the cluster with the argument '-py-files'

Question

Submit the PySpark job to the cluster with the argument '-py-files'

I tried to send the job using the GCS uri zip python files to use (via the -py-files argument) and the python file name as the value of the PY_FILE argument. This does not seem to work. Should I provide some relative path for the value of PY_FILE? PY_FILE is also included in zip. for example, in

gcloud beta dataproc jobs submit pyspark  --cluster clustername --py-files gcsuriofzip PY_FILE

What should be the value of PY_FILE?

+4

google-cloud-dataproc

bjorndv 25 sept. '15 at 15:43

source share

1 answer

James · Accepted Answer · 2015-09-25T21:02:36+0000

That's a good question. To answer this question, I am going to use the PySpark wordcount example .

, test.py, , , - wordcount.py.zip, zip, wordcount.py , .

test.py :

import wordcount
import sys
if __name__ == "__main__":
    wordcount.wctest(sys.argv[1])

wordcount.py, :

...
from pyspark import SparkContext

...
def wctest(path):
    sc = SparkContext(appName="PythonWordCount")
...

Dataproc gcloud:

gcloud beta dataproc jobs submit pyspark  --cluster <cluster-name> \
--py-files gs://<bucket>/wordcount.py.zip gs://<bucket>/test.py \ 
gs://<bucket>/input/input.txt

<bucket> - ( ) , <cluster-name> - Dataproc.

Submit the PySpark job to the cluster with the argument '-py-files'

More articles: