How does Hadoop receive input not saved on HDFS?

I am trying to wrap my brain around Hadoop and read this excellent tutorial , as well as read the official Hadoop docs. However, in none of this literature can I find a simple explanation for something rather rudimentary:

In all far-fetched "Hello world!" (number of words) MR introductory examples, input data is stored directly in text files. However, it seems to me that this rarely happens in the real world. I would suggest that in fact the input data will exist in large data stores, such as a relational database, Mongo, Cassandra, or only accessible through the REST API, etc.

So, I ask: In the real world, how does Hadoop get its input? I see that there are projects like Sqoop and Flume , and I'm wondering if the whole point of these frameworks is just ETL input to HDFS for MR tasks.

+6
source share
1 answer

In fact, HDFS is required in the Real World application for many reasons.

  • Very high bandwidth to support the card Reduce workloads and scalability.
  • Data reliability and fault tolerance. Thanks to replication and distributed nature. Required for critical data systems.
  • Flexibility - you do not need to pre-process data to store data in HDFS.

Hadoop is designed to write and read many concepts once. Kafka, Flume and Sqoop, which are commonly used for swallowing, are themselves very fault tolerant and provide high bandwidth for receiving data in HDFS. Sometimes you need to collect data from thousands of sources per minute with data in GB. To do this, you need these tools, as well as fault-tolerant HDFS storage.

+4
source

Source: https://habr.com/ru/post/989724/


All Articles