Yes, Shagun is right.
Basically, when you submit a spark task to spark formation, it does not serialize the file that you want each worker to process. You will have to do it yourself.
Typically, you will have to put the file on a shared file system such as HDFS, S3 (amazon), or any other DFS that all workers can access. Once you do this and specify the destination of the file in your spark script, the spark job will be available for reading and processing as you wish.
However, having said that, copying the file to the same destination in all your working and main file structures also works. Exp, you can create folders like /opt/spark-job/all-files/ in all spark nodes, an rsync file for all of them, and then you can use the file in your spark script. But please do not do this. DFS or S3 is better than this approach.
source share