While submitting a job using pyspark, how do I access the download of static files using the --files argument?

For example, I have a folder:

/ - test.py - test.yml 

and the task is sent to the spark cluster using:

gcloud beta dataproc jobs submit pyspark --files=test.yml "test.py"

in test.py , I want to access the static file that I downloaded.

 with open('test.yml') as test_file: logging.info(test_file.read()) 

but received the following exception:

 IOError: [Errno 2] No such file or directory: 'test.yml' 

How to access the downloaded file?

+5
source share
2 answers

Files distributed using SparkContext.addFile (and --files ) can be accessed through SparkFiles . It provides two methods:

  • getDirectory() - returns the root directory for distributed files
  • get(filename) - returns the absolute path to the file

I'm not sure if there are any Dataproc limitations, but something like this should work fine:

 from pyspark import SparkFiles with open(SparkFiles.get('test.yml')) as test_file: logging.info(test_file.read()) 
+10
source

Yes, Shagun is right.

Basically, when you submit a spark task to spark formation, it does not serialize the file that you want each worker to process. You will have to do it yourself.

Typically, you will have to put the file on a shared file system such as HDFS, S3 (amazon), or any other DFS that all workers can access. Once you do this and specify the destination of the file in your spark script, the spark job will be available for reading and processing as you wish.

However, having said that, copying the file to the same destination in all your working and main file structures also works. Exp, you can create folders like /opt/spark-job/all-files/ in all spark nodes, an rsync file for all of them, and then you can use the file in your spark script. But please do not do this. DFS or S3 is better than this approach.

0
source

Source: https://habr.com/ru/post/1241194/