Best way to process data at a time from hdfs file from CPython (without using stdin)?

Question

Best way to process data at a time from hdfs file from CPython (without using stdin)?

I would like to use CPython in a hadoop streaming job that needs access to additional information from a linearly oriented file stored in the hadoop file system. By "optional" I mean that this file is in addition to the information transmitted via stdin. The additional file is large enough, and I can’t just decompose it into memory and parse the end of line characters. Is there a particularly elegant way (or library) to process this file one line at a time?

Thanks,

Setjmp

+4

python hadoop hdfs line

Setjmp Dec 19 '11 at 2:20

source share

2 answers

Are you looking for this?

http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#module-pydoop.hdfs

 with pydoop.hdfs.open( "supplementary", "r" ) as supplementary: for line in supplementary: # process line

+1

S. Lott Dec 19 '11 at 2:52

source share

Donald miner · Accepted Answer · 2011-12-19T02:53:58+0000

Check this documentation for streaming to use the Hadoop distributed cache in Hadoop Streaming. First you upload the file to hdfs, then you tell Hadoop to repeat it everywhere before starting the task, and then it’s convenient to place a symbolic link in the task’s working directory. Then you can simply use python open() to read the file using for line in f or whatever.

Distributed cache is the most efficient way to move files (out of the box) for a job to use as a resource. You do not just want to open the hdfs file from your process, as each task will try to transfer the file over the network ... In a distributed cache, one copy is downloaded, even if several tasks are performed on the same node.

First add -files hdfs://NN:9000/user/sup.txt#sup.txt to the command line arguments when starting the job.

Then:

 for line in open('sup.txt'): # do stuff

Best way to process data at a time from hdfs file from CPython (without using stdin)?

More articles: