I am trying to wrap my brain around Hadoop and read this excellent tutorial , as well as read the official Hadoop docs. However, in none of this literature can I find a simple explanation for something rather rudimentary:
In all far-fetched "Hello world!" (number of words) MR introductory examples, input data is stored directly in text files. However, it seems to me that this rarely happens in the real world. I would suggest that in fact the input data will exist in large data stores, such as a relational database, Mongo, Cassandra, or only accessible through the REST API, etc.
So, I ask: In the real world, how does Hadoop get its input? I see that there are projects like Sqoop and Flume , and I'm wondering if the whole point of these frameworks is just ETL input to HDFS for MR tasks.
source share