I am new to Spark Streaming and I have 5 work nodes in my cluster. The goal is to use the Kafka theme and save it directly to a NoSql database such as HBase or DynamoDB. I am trying to understand how Spark processes instances of Kafka Consumer and distributes them through Workers (Spark 0.9.0 and Kafka 0.8).
If I create a Kafka stream
val topicMap = Map("myTopic" -> 1)
val kafkaDStream = KafkaUtils.createStream(ssc, zookeeper, group, topicMap).map(_._2)
and perform stream operations such as
val valueStream = kafkaDStream.map(
s => {
val json = new JsonWrapper
val js = json.parse(s)
val a = (js \ "a").toString
val b = (js \ "b").toString
val c = (js \ "c").toString
(a, b, c)
}
)
valueStream.foreachRDD(
rdd => {
rdd.foreach(
row =>
)
}
)
ssc.start()
ssc.awaitTermination()
Where exactly are Kafka consumers created? . Does the driver program contain instances of Consumer and distribute them to employees, or do workers create Consumer as needed?
Here's more about JsonWrapper if you need it:
import play.api.libs.json.Json
class JsonWrapper extends Serializable {
lazy val jsObj = Json
def parse(s: String) = jsObj.parse(s)
}
Json ( Play) .