How to download Kafka theme in HDFS?

I am using the hortonworks sandbox.
topic creation :

./kafka-topics.sh --create --zookeeper 10.25.3.207:2181 --replication-factor 1 --partitions 1 --topic lognew 

apache access log directory lag :

 tail -f /var/log/httpd/access_log |./kafka-console-producer.sh --broker-list 10.25.3.207:6667 --topic lognew 

On another terminal (from kafka bin), run the user :

 ./kafka-console-consumer.sh --zookeeper 10.25.3.207:2181 --topic lognew --from-beginning 

Apache access logs are sent to kafka's " lognew " topic.

I need to save them in HDFS.
Any ideas or suggestions on how to do this.

Thanks in advance. Deepthy

+5
source share
2 answers

we use camus .

Camus is a simple MapReduce work developed by LinkedIn to download data from Kafka to HDFS. He is able to gradually copy data from Kafka to HDFS, so that every run of the MapReduce job where the previous run stopped. LinkedIn Camus is used to download billions of messages per day from Kafka to HDFS.

But it looks like it has been replaced with gobblin

Gobblin is a universal data analysis system for extracting, converting and loading large amounts of data from multiple data sources, for example, databases, leisure APIs, FTP / SFTP servers, filter files, etc. on Hadoop. The gobble performs the routine tasks necessary for all receiving ETL data, including task / task scheduling, task splitting, error handling, status management, data quality checking, publication data, etc. A gobblin swallows data from different data sources into the same execution structure and manages metadata from different sources all in one place. This, combined with other features such as automatic scalability, fault tolerance, data quality assurance, extensibility, and the ability to handle the evolution of the data model, makes Gobblin an easy-to-use, self-serving and efficient data reception framework.

+2
source

You have a few more options:

  • Use Apache Flume to read messages from Kafka and write them to HDFS. There are several examples of how you can customize it, but one article from Cloudera describes this topic quite well. They even called the Flafka solution;)
  • Use the Kafka HDFS Connector , which is fairly easy to configure. However, this will require Confluent Kafka (which still remains open).

We checked both quite successfully.

0
source

Source: https://habr.com/ru/post/1236583/


All Articles