The -files option in pyspark does not work

I tried the sc.addFile option (working without any problems) and --files from the command line (failed).

Launch 1: spark_distro.py

 from pyspark import SparkContext, SparkConf from pyspark import SparkFiles def import_my_special_package(x): from external_package import external ext = external() return ext.fun(x) conf = SparkConf().setAppName("Using External Library") sc = SparkContext(conf=conf) sc.addFile("/local-path/readme.txt") with open(SparkFiles.get('readme.txt')) as test_file: lines = [line.strip() for line in test_file] print(lines) int_rdd = sc.parallelize([1, 2, 4, 3]) mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x:import_my_special_package(x))) 

external package: external_package.py

 class external(object): def __init__(self): pass def fun(self,input): return input*2 

readme.txt

 MY TEXT HERE 

spark-submit command

 spark-submit \ --master yarn-client \ --py-files /path to local codelib/external_package.py \ /local-pgm-path/spark_distro.py \ 1000 

Result: Work on hold

 ['MY TEXT HERE'] 

But if I try to transfer the file (readme.txt) from the command line using the -files option (instead of sc.addFile), it does not work. Like below.

Run 2: spark_distro.py

 from pyspark import SparkContext, SparkConf from pyspark import SparkFiles def import_my_special_package(x): from external_package import external ext = external() return ext.fun(x) conf = SparkConf().setAppName("Using External Library") sc = SparkContext(conf=conf) with open(SparkFiles.get('readme.txt')) as test_file: lines = [line.strip() for line in test_file] print(lines) int_rdd = sc.parallelize([1, 2, 4, 3]) mod_rdd = sorted(int_rdd.filter(lambda z: z%2 == 1).map(lambda x: import_my_special_package(x))) 

external_package.py Same as above

spark submit

 spark-submit \ --master yarn-client \ --py-files /path to local codelib/external_package.py \ --files /local-path/readme.txt#readme.txt \ /local-pgm-path/spark_distro.py \ 1000 

Output:

 Traceback (most recent call last): File "/local-pgm-path/spark_distro.py", line 31, in <module> with open(SparkFiles.get('readme.txt')) as test_file: IOError: [Errno 2] No such file or directory: u'/tmp/spark-42dff0d7-c52f-46a8-8323-08bccb412cd6/userFiles-8bd16297-1291-4a37-b080-bbc3836cb512/readme.txt' 

Is sc.addFile and --file for this purpose? Someone can share their thoughts.

+5
source share
4 answers

I finally figured out the problem, and it is really very subtle.

As expected, the two options ( sc.addFile and --files ) are not equivalent, and this (admittedly very subtle) is outlined in the documentation (highlighted by me):

addFile (path, recursive = false)
Add a file to be uploaded using this Spark job on each node .

--files FILES
A comma-separated list of files that should be placed in the working directory of each artist .

In plain English, while files added using sc.addFile are available for both artists and drivers, files added using --files are available only to artists; therefore, when trying to access them from the driver (as is the case with OP), we get a No such file or directory error.

Confirm this (getting rid of all unnecessary things --py-files and 1000 in OP):

test_fail.py

 from pyspark import SparkContext, SparkConf from pyspark import SparkFiles conf = SparkConf().setAppName("Use External File") sc = SparkContext(conf=conf) with open(SparkFiles.get('readme.txt')) as test_file: lines = [line.strip() for line in test_file] print(lines) 

Test:

 spark-submit --master yarn \ --deploy-mode client \ --files /home/ctsats/readme.txt \ /home/ctsats/scripts/SO/test_fail.py 

Result:

 [...] 17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt [...] Traceback (most recent call last): File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module> with open(SparkFiles.get('readme.txt')) as test_file: IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt' 

In the script test_fail.py above, the test_fail.py program requests access to the readme.txt ; modify the script so that the request is requested for executors ( test_success.py ):

 from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("Use External File") sc = SparkContext(conf=conf) lines = sc.textFile("readme.txt") # run in the executors print(lines.collect()) 

Test:

 spark-submit --master yarn \ --deploy-mode client \ --files /home/ctsats/readme.txt \ /home/ctsats/scripts/SO/test_success.py 

Result:

 [...] 17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt [...] [u'MY TEXT HERE'] 

Please note that here we do not need SparkFiles.get - the file is easily accessible.

As mentioned above, sc.addFile will work in both cases, that is, when the request is requested either by the driver or by the executors (tested, but not shown here).

Regarding the order of the command line parameters: as I stated elsewhere , all arguments related to the game must be before the script is executed; perhaps the relative order of --files and --py-files doesn't matter (leaving it as an exercise).

Tested with Spark 1.6.0 and 2.2.0 .

UPDATE (after comments): It seems that my fs.defaultFS installation points to HDFS:

 $ hdfs getconf -confKey fs.defaultFS hdfs://host-hd-01.corp.nodalpoint.com:8020 

But let me focus on the forest here (instead of trees, that is), and explain why all this discussion is of academic interest only :

Transferring files processed using the --files flag is bad practice; Looking back, now I can understand why I can not find almost any links on the Internet - probably no one uses it in practice and not without reason.

(Note that I'm not talking about --py-files , which has a different, legitimate role.)

Since Spark is a distributed processing platform operating through a cluster and distributed file system (HDFS), it is best that all files are processed into an already installed HDFS. The โ€œnaturalโ€ place for files processed by Spark is HDFS, not the local FS, although there are toy examples that use the local FS for demonstration purposes only. What else, if you want to change the deployment mode to cluster in the future, you will find that the default cluster knows nothing about local paths and files, and thatโ€™s really so ...

+2
source

Same purpose, but different uses.

With the --files option

  --files FILES Comma-separated list of files to be placed in the working directory of each executor. 

you do not need SparkFiles.get to get the location of the file, because it really is in the working directory of each executor .

 with open('readme.txt') as test_file: lines = [line.strip() for line in test_file] 
0
source

try

 import os from os import path here = os.getcwd() with open(path.join(here, 'readme.txt')) as f: pass 

read By sending applications , he said:

For Python applications, just pass the .py file instead of the JAR instead and add the Python.zip, .egg or .py files to the search path using -py files.

search path is the key

-1
source

Pass this file as an argument, using it in spark-submit as sc = SparkContext(conf=conf) with open(SparkFiles.get(args(0))) as test_file: in the program and in the submit claim

 spark-submit \ --master yarn-client \ /local-pgm-path/spark_distro.py \ /local/file/path 

The file path should appear after the application jar file.

-1
source

Source: https://habr.com/ru/post/1273219/


All Articles