Communication between Hadoop and Databases

Good .. I tried searching the Internet and on this site the answer to this question, which seems to be a very simple question. I am a complete noob for big data processing.

I want to know the relationship between HDFS and databases. Is it always necessary to use HDFS, should the data be in some kind of NoSQL format? Is there a specific database that always connects when using HDFS? I know that cloudera offers Hadoop solutions and they use HBase.

Can I use a relational database as my own database for Hadoop?

+6
source share
3 answers
I want to know the relationship between HDFS and databases. 

Meanwhile, there is no relation between them. If you still want to find some kind of similarity, the only thing that is common to the second is to provide store data. But this is similar to any combination of FS and DB. MySQL and ext3, for example. You say that you store data in MySQL, but ultimately your data is stored at the top of your FS. Typically, people use NoSQL databases, such as HBase, on top of their Hadoop cluster to use the parallelism and distributed behavior provided by HDFS.

 Is it always necessary that to use HDFS, the data be in a some NoSQL format? 

Actually nothing like the NoSQL format . You can use HDFS for any data, text, binaries, xml, etc. Etc.

 Is there a specific database that always comes attached when using HDFS? 

No. The only thing related to HDFS is the MapReduce framework . Obviously, you can make a database to work with HDFS. Often users use NoSQL databases on top of HDFS. There are several options like Cassandra, HBase, etc. It is entirely up to you to decide which one to use.

 Can I use a relational database as the native database for Hadoop? 

There is no OOTB function that allows this. Moreover, it makes no sense to use an RDBMS with Hadoop. Hadoop was designed for cases where RDBMS is not a suitable option, for example, processing PB data, processing unstructured data, etc. Having said that, you should not think of Hadoop as a replacement for RDBMB. Both have completely different goals.

EDIT:

Usually people use NoSQL DB (e.g. HBase, Cassandra) with Hadoop. Using these databases with howoop is just a configuration issue. For this you do not need any connecting program. Besides the point made by @Doctor Dan, there are several other reasons why instead of SQL bats instead of SQL bugs instead of NoSQL DB. One thing size . These NoSQL databases provided excellent horizontal scalability, which makes it easy to store PB data. You can scale traditional systems, but vertically. Another reason is data complexity . The locations where these databases are used mainly process unstructured data that is not easy to handle using traditional systems. For example, sensor data, log data, etc.

Basically, I did not understand why SQOOP exists. Why we cannot directly use SQL data on Hadoop.

Although Hadoop is very good at handling your BigData needs, it is not the solution to all your needs. It is not suitable for real-time needs. Suppose you are an online transaction company with a very huge data set. You will learn that you can easily process this data with Hadoop. But the problem is that you cannot meet the needs of customers in real time using Hadoop. This is where SQOOP comes into the picture. This is an import / export tool that allows you to move data between SQL DB and Hadoop. You can transfer your BigData to your Hadoop cluster, process it, and then return the results back to your SQL database using SQOOP to meet the needs of your customers in real time.

NTN

+12
source

What you really want to achieve is not clear from your question.

There is only an indirect relationship between HDFS and the database. HDFS is a file system, not a database. Hadoop is a combination of parallel processing (MapReduce) and the HDFS file system. The parallel processing framework captures chunks of data from the HDFS file system using something called InputFormat. Some databases, such as: Oracle NoSQL Database (ONDB), Cassandra, Riak, others have the ability to return an InputFormat containing their data, so they can participate as a source for processing MapReduce, like data from HDFS.

So what do you want to do?

Hadoop and HDFS are generally useful when you have a large amount of data that has not yet been aggregated and / or structured into some model necessary for processing at a higher level. Sometimes Hadoop can be used for higher-level processing, which is usually done in another processing / storage technology that uses a decent model. Think Google Instant, creating the search index used to work on MapReduce, then they developed a model and now use a better approach. Failed to run Google Instant on MapReduce.

+1
source

The advantage of Hadoop is its ability to store data with replication, so you cannot turn off Hadoop, say, SQL Server, and that doesn't make much sense. There are HBase, Hive, and Pig (and others) environments that can be configured to work with Hadoop, and they look and feel like regular SQL languages. Check out the Hortonworks Sandbox if you want to have something to play, as they say, from 0 to big data in 15 minutes. Hope this helps.

0
source

Source: https://habr.com/ru/post/948658/


All Articles