I am generating some delimited files from query bushes into multiple HDFS directories. As a next step, I would like to read the files into a single pandas dataframe for applying standard unallocated algorithms.
At some level, a workable solution is trivial, using "hadoop dfs -copyTolocal" followed by the actions of the local file system, however, I am looking for a particularly elegant way to load data, which I will include in my standard practice.
Some characteristics of an ideal solution:
- No need to create a local copy (who likes to clean up?)
- Minimum number of system calls
- A few lines of Python code
source share