Best way to process data at a time from hdfs file from CPython (without using stdin)?

I would like to use CPython in a hadoop streaming job that needs access to additional information from a linearly oriented file stored in the hadoop file system. By "optional" I mean that this file is in addition to the information transmitted via stdin. The additional file is large enough, and I can’t just decompose it into memory and parse the end of line characters. Is there a particularly elegant way (or library) to process this file one line at a time?

Thanks,

Setjmp

+4
source share
2 answers

Check this documentation for streaming to use the Hadoop distributed cache in Hadoop Streaming. First you upload the file to hdfs, then you tell Hadoop to repeat it everywhere before starting the task, and then it’s convenient to place a symbolic link in the task’s working directory. Then you can simply use python open() to read the file using for line in f or whatever.

Distributed cache is the most efficient way to move files (out of the box) for a job to use as a resource. You do not just want to open the hdfs file from your process, as each task will try to transfer the file over the network ... In a distributed cache, one copy is downloaded, even if several tasks are performed on the same node.


First add -files hdfs://NN:9000/user/sup.txt#sup.txt to the command line arguments when starting the job.

Then:

 for line in open('sup.txt'): # do stuff 
+3
source

Are you looking for this?

http://pydoop.sourceforge.net/docs/api_docs/hdfs_api.html#module-pydoop.hdfs

 with pydoop.hdfs.open( "supplementary", "r" ) as supplementary: for line in supplementary: # process line 
+1
source

Source: https://habr.com/ru/post/1386978/


All Articles