I finally figured out the problem, and it is really very subtle.
As expected, the two options ( sc.addFile and --files ) are not equivalent, and this (admittedly very subtle) is outlined in the documentation (highlighted by me):
addFile (path, recursive = false)
Add a file to be uploaded using this Spark job on each node .
--files FILES
A comma-separated list of files that should be placed in the working directory of each artist .
In plain English, while files added using sc.addFile are available for both artists and drivers, files added using --files are available only to artists; therefore, when trying to access them from the driver (as is the case with OP), we get a No such file or directory error.
Confirm this (getting rid of all unnecessary things --py-files and 1000 in OP):
test_fail.py
from pyspark import SparkContext, SparkConf from pyspark import SparkFiles conf = SparkConf().setAppName("Use External File") sc = SparkContext(conf=conf) with open(SparkFiles.get('readme.txt')) as test_file: lines = [line.strip() for line in test_file] print(lines)
Test:
spark-submit --master yarn \ --deploy-mode client \ --files /home/ctsats/readme.txt \ /home/ctsats/scripts/SO/test_fail.py
Result:
[...] 17/11/10 15:05:39 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0047/readme.txt [...] Traceback (most recent call last): File "/home/ctsats/scripts/SO/test_fail.py", line 6, in <module> with open(SparkFiles.get('readme.txt')) as test_file: IOError: [Errno 2] No such file or directory: u'/tmp/spark-8715b4d9-a23b-4002-a1f0-63a1e9d3e00e/userFiles-60053a41-472e-4844-a587-6d10ed769e1a/readme.txt'
In the script test_fail.py above, the test_fail.py program requests access to the readme.txt ; modify the script so that the request is requested for executors ( test_success.py ):
from pyspark import SparkContext, SparkConf conf = SparkConf().setAppName("Use External File") sc = SparkContext(conf=conf) lines = sc.textFile("readme.txt")
Test:
spark-submit --master yarn \ --deploy-mode client \ --files /home/ctsats/readme.txt \ /home/ctsats/scripts/SO/test_success.py
Result:
[...] 17/11/10 15:16:05 INFO yarn.Client: Uploading resource file:/home/ctsats/readme.txt -> hdfs://host-hd-01.corp.nodalpoint.com:8020/user/ctsats/.sparkStaging/application_1507295423401_0049/readme.txt [...] [u'MY TEXT HERE']
Please note that here we do not need SparkFiles.get - the file is easily accessible.
As mentioned above, sc.addFile will work in both cases, that is, when the request is requested either by the driver or by the executors (tested, but not shown here).
Regarding the order of the command line parameters: as I stated elsewhere , all arguments related to the game must be before the script is executed; perhaps the relative order of --files and --py-files doesn't matter (leaving it as an exercise).
Tested with Spark 1.6.0 and 2.2.0 .
UPDATE (after comments): It seems that my fs.defaultFS installation points to HDFS:
$ hdfs getconf -confKey fs.defaultFS hdfs://host-hd-01.corp.nodalpoint.com:8020
But let me focus on the forest here (instead of trees, that is), and explain why all this discussion is of academic interest only :
Transferring files processed using the --files flag is bad practice; Looking back, now I can understand why I can not find almost any links on the Internet - probably no one uses it in practice and not without reason.
(Note that I'm not talking about --py-files , which has a different, legitimate role.)
Since Spark is a distributed processing platform operating through a cluster and distributed file system (HDFS), it is best that all files are processed into an already installed HDFS. The โnaturalโ place for files processed by Spark is HDFS, not the local FS, although there are toy examples that use the local FS for demonstration purposes only. What else, if you want to change the deployment mode to cluster in the future, you will find that the default cluster knows nothing about local paths and files, and thatโs really so ...