Spark Streaming: Queued Microbatch

I tried to make an example Function window using pyspark.

conf = (SparkConf()
     .setMaster("local[2]")
     .setAppName("PythonStreamingKafka")
     )

sc = SparkContext(conf = conf)
ssc = StreamingContext(sc, 5)
input_stream = KafkaUtils.createDirectStream(ssc, [thing_topic], {"metadata.broker.list": brokers})

messages = input_stream.map(lambda x: x[1]).window(10,5)
messages.foreachRDD(windowMetrics)

def windowMetrics(rdd):
     print ("Window RDD size:", rdd.count())

Getting data from Kafka. After starting the launch, as you can see in the next Spark GUI, only one batch is executed, and all other batches are queued. In fact, processing does not occur inside the window, because kafka does not provide data for processing (InputSize = 0).

Saprkgui

If window operations are not used, it works great. The code is below.

 messages = input_stream.map(lambda x: x[1])
 messages.foreachRDD(windowMetrics)

I tried some search queries and found some related content, and none of them had the same problem.

Spark Streaming: long lines / active parties

Here, intermediate batches are queued.

https://forums.databricks.com/questions/1276/kafka-direct-api-from-spark-streaming-what-happens.html

. , , .

+4

Source: https://habr.com/ru/post/1648730/


All Articles