Reading files in HDFS directories (Hadoop file system) in the Pandas framework

I am generating some delimited files from query bushes into multiple HDFS directories. As a next step, I would like to read the files into a single pandas dataframe for applying standard unallocated algorithms.

At some level, a workable solution is trivial, using "hadoop dfs -copyTolocal" followed by the actions of the local file system, however, I am looking for a particularly elegant way to load data, which I will include in my standard practice.

Some characteristics of an ideal solution:

  • No need to create a local copy (who likes to clean up?)
  • Minimum number of system calls
  • A few lines of Python code
+4
source share
1 answer

It seems that the pydoop.hdfs module solves this problem by fulfilling a good set of goals:

http://pydoop.sourceforge.net/docs/tutorial/hdfs_api.html

I have not been able to appreciate this since pydoop has very strict compilation requirements and my version of Hadoop is a bit outdated.

+3
source

Source: https://habr.com/ru/post/1481265/


All Articles