Reading files in HDFS directories (Hadoop file system) in the Pandas framework

Question

Reading files in HDFS directories (Hadoop file system) in the Pandas framework

I am generating some delimited files from query bushes into multiple HDFS directories. As a next step, I would like to read the files into a single pandas dataframe for applying standard unallocated algorithms.

At some level, a workable solution is trivial, using "hadoop dfs -copyTolocal" followed by the actions of the local file system, however, I am looking for a particularly elegant way to load data, which I will include in my standard practice.

Some characteristics of an ideal solution:

No need to create a local copy (who likes to clean up?)
Minimum number of system calls
A few lines of Python code

+4

python pandas hadoop hdfs

Setjmp May 16, '13 at 21:47

source share

1 answer

Setjmp · Answer 1 · 2013-05-21T22:39:57+0000

It seems that the pydoop.hdfs module solves this problem by fulfilling a good set of goals:

http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html

I have not been able to appreciate this since pydoop has very strict compilation requirements and my version of Hadoop is a bit outdated.

Reading files in HDFS directories (Hadoop file system) in the Pandas framework

More articles: