Kafka Consumers at Spark Streaming - Parallel Consumption at Work Nodes

I am new to Spark Streaming and I have 5 work nodes in my cluster. The goal is to use the Kafka theme and save it directly to a NoSql database such as HBase or DynamoDB. I am trying to understand how Spark processes instances of Kafka Consumer and distributes them through Workers (Spark 0.9.0 and Kafka 0.8).

If I create a Kafka stream

    val topicMap = Map("myTopic" -> 1)
    val kafkaDStream = KafkaUtils.createStream(ssc, zookeeper, group, topicMap).map(_._2)

and perform stream operations such as

    val valueStream = kafkaDStream.map(
        s => {
            val json = new JsonWrapper
            val js = json.parse(s)
            val a = (js \ "a").toString
            val b = (js \ "b").toString
            val c = (js \ "c").toString
            (a, b, c)
        }
    )

    valueStream.foreachRDD(
        rdd => {
            rdd.foreach(
                row => // put (a,b,c) into DB (HBase or DynamoDB)
            )
        }
    )
    ssc.start()
    ssc.awaitTermination()

Where exactly are Kafka consumers created? . Does the driver program contain instances of Consumer and distribute them to employees, or do workers create Consumer as needed?

Here's more about JsonWrapper if you need it:

import play.api.libs.json.Json

class JsonWrapper extends Serializable {
    lazy val jsObj = Json
    def parse(s: String) = jsObj.parse(s)
}

Json ( Play) .

+4

Source: https://habr.com/ru/post/1531251/


All Articles