What is "Hadoop" - the definition of Hadoop?

Question

What is "Hadoop" - the definition of Hadoop?

This is obvious, and we all agree that we can call HDFS + YARN + MapReduce as Hadoop. But what happens to other other combinations and other products in the Hadoop ecosystem?

Is, for example, HDFS + YARN + Spark still Hadoop? Is HBASE Hadoop? I think we are considering HDFS + YARN + Pig Hadoop since Pig uses MapReduce.

Are you using MapReduce Hadoop only, but does anything else work on HDFS + YARN (like Spark) - is it not Hadoop?

+6

hbase hadoop hdfs yarn apache-spark

neuromouse Jan 24 '15 at 19:41

source share

4 answers

In addition to defining apoop apache from the Official Website , I would like to emphasize that Hadoop is a framework and there are many subsystems in the Hadoop ecosystem

To quote this content from the official website so that broken links in the future do not cause any problems with this answer.

The project includes the following modules:

Hadoop Common: General utilities that support other Hadoop modules.

Hadoop Distributed File System (HDFS™): distributed file system that provides high-performance access to application data.

Hadoop YARN: framework for job scheduling and cluster resource management.

Hadoop MapReduce: A YARN-based system for processing large datasets in parallel.

More or less,

Hadoop = Distributed Memory (HDFS) + Distributed Processing (YARN abbreviation + card)

But these four modules do not cover the entire Hadoop ecosystem. Hadoop Ecosystems has many Hadoop related projects and over 40 subsystems.

Other Hadoop related projects in Apache include:

Ambari™: A web-based tool for supporting, managing and monitoring Apache Hadoop clusters, which includes support for Hadoop HDFS, Hadoop MapReduce, Hive, HC catalog, HBase, ZooKeeper, Oozie, Pig and Sqoop.

Avro™ : data serialization system.

Cassandra™: scalable multi-wizard database with no points of failure.

Chukwa™: A data acquisition system for managing large distributed systems.

HBase™: scalable distributed database that supports structured data storage for large tables.

Hive™: A data warehouse infrastructure that provides data summaries and special requests.

Mahout™: scalable machine learning and data mining library.

Pig™: high-level data flow language and runtime structure for parallel computing.

Spark™: fast and general computing engine for Hadoop data. Spark offers a simple and expressive programming model that supports a wide range of applications, including ETL, machine learning, flow processing, and graph calculation.

Tez™: A generic data flow programming framework built on Hadoop YARN that provides a powerful and flexible mechanism for performing arbitrary DAG data processing tasks for batch and interactive use cases. Tez is being adopted by Hive ™, Pig ™, and other entities in the Hadoop ecosystem, as well as other commercial software (such as ETL tools), to replace Hadoop ™ MapReduce as its primary execution engine.

ZooKeeper™: High-performance coordination service for distributed applications.

Returning to your question:

Just take a look at the 40+ subsystems in the Hadoop eco system. Every thing you quoted may not be Hadoop, but most of them are related to Hadoop.

The spark is part of the Hadoop ecosystem. But he cannot use HDFS and YARN . HDFS datasets can be replaced with RDD (Flexible Distributed Dataset) and can operate in Standalone mode without YARN .

Take a look at this article and article for a comparison of Hadoop and Spark.

Examples of using sparks over Hadoop:

Iterative Algorithms in Machine Learning
Interactive data mining and data processing
Stream processing
Sensor data processing

Since Spark does not have a storage system, it should depend on one of the distributed repositories, where HDFS is one of them.

Take a look at the related SE question:

Can launch apache without chaos?

+2

Ravindra babu Oct 9 '15 at 15:50

source share

The most common understanding of Hadoop: HDFS and Map / Reduce, as well as related processes and tools.

Associated term: Hadoop ecosystem: Hive / Pig / Hbase, Zookeeper, Oozie. Also supplier specific, such as impala, ambari.

+1

javadba Jan 28 '15 at 10:38

source share

Why do we need a large data system?

STORE (for storing a huge amount of data)
PROCESS (process data / requests in a timely manner)
SCALE (scales easily as data grows)

There was a great data solution provided by google.

Google file system: for distributed storage solutions.
Map reduction: for solving distributed computing.

Google has published research documents. Apache has developed an open source system similar to that developed by Google, it is known as HADOOP .

Like the Google file system in hadoop HDFS (Hadoop Distributed File System): a file system for managing storage.
MAP Reduce : a platform for processing data on multiple servers.

Note. In 2013, Apache released HADOOP 2.0 (MapReduce was broken into two components:

MapReduce: An environment for defining a data processing task.
YARN: structure for performing a data processing task. )

HOSOOP ECOSYSTEM

Hadoop was not easy to understand, and it was limited to a tough developer. therefore simplify the use of hadoop. Many tools have appeared that are generally known as the Hadoop Ecosystem.

Hadoop Ecosystem contains tools such as:

HIVE :
- provides an SQL interface for hadoop.
- a bridge to hadoop for people who have no contact with OOP in JAVA.
Hbase :
- database management system on top of haopa.
- integrates with our application as a traditional database.
PIG :
- Data manipulation language.
- converts unstructured data into a structured format.
- Retrieve this structured data using interfaces such as Hive.
SPARK :
- A distributed computing engine used with Hadoop.
- Interactive shell for fast processing of data sets.
- has a bunch of built-in libraries for processing machine learning flow, processing graphs, etc.
OOZIE :
- a tool for planning workflows across all hadoop ecosystem technologies.
FLUME / SQOOP :
- tools for transferring data between other systems and hadoop.

This completes the very high level review of Hadoop.

+1

Raman gupta 01 Oct '17 at 15:34

source share

Daniel Darabos · Accepted Answer · 2015-01-24T19:58:34+0000

I agree with your impression that the term “Hadoop” does not have a useful definition. “We have a Hadoop cluster” can mean different things.

There is an official answer, although at http://hadoop.apache.org/#What+Is+Apache+Hadoop%3F :

The Apache ™ Hadoop® project develops open source software for reliable, scalable distributed computing.
The Apache Hadoop software library is a framework that allows distributed processing of large data sets in computer clusters using simple programming models.

So, "Hadoop" is the name of the project and software library. Any other use is poorly defined.

What is "Hadoop" - the definition of Hadoop?

More articles: