Check this documentation for streaming to use the Hadoop distributed cache in Hadoop Streaming. First you upload the file to hdfs, then you tell Hadoop to repeat it everywhere before starting the task, and then it’s convenient to place a symbolic link in the task’s working directory. Then you can simply use python open() to read the file using for line in f or whatever.
Distributed cache is the most efficient way to move files (out of the box) for a job to use as a resource. You do not just want to open the hdfs file from your process, as each task will try to transfer the file over the network ... In a distributed cache, one copy is downloaded, even if several tasks are performed on the same node.
First add -files hdfs://NN:9000/user/sup.txt#sup.txt to the command line arguments when starting the job.
Then:
for line in open('sup.txt'):
source share