Spark never shuts down and malfunctioning JobProgressListener

We have a Spark application that continuously processes many inbound jobs. Several jobs are processed in parallel, on multiple threads.

During intense workloads, at some point, we begin to have these warnings:

16/12/14 21:04:03 WARN JobProgressListener: Task end for unknown stage 147379 16/12/14 21:04:03 WARN JobProgressListener: Job completed for unknown job 64610 16/12/14 21:04:04 WARN JobProgressListener: Task start for unknown stage 147405 16/12/14 21:04:04 WARN JobProgressListener: Task end for unknown stage 147406 16/12/14 21:04:04 WARN JobProgressListener: Job completed for unknown job 64622 

Starting with this, the performance of the plumb application, most of the stages and tasks never end. In SparkUI, I see numbers, for example, 13,000 pending / active jobs.

I cannot see that another exception happens earlier with additional information. Perhaps this one, but it concerns another listener:

 16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. 16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since Thu Jan 01 01:00:00 CET 1970 

This is a very nasty problem, because it is not an obvious crash, or a clear ERROR message that we could catch in order to restart the application.

UPDATE:

  • The problem occurs with Spark 2.0.2 and Spark 2.1.1
  • Most likely related to SPARK-18838

What bothers me the most is that I expect this to happen with large configurations (a large cluster will make it easier to use a DDOS driver with the results of the task), but it is not. Our cluster is a bit small, the only feature is that we tend to mix small and large files for processing, and small files generate a lot of tasks that end quickly.

+5
source share
1 answer

Perhaps I found a workaround:

Changing the value of spark.scheduler.listenerbus.eventqueue.size (100000 instead of 10000 by default) seems to help, but it can only delay the problem.

+2
source

Source: https://habr.com/ru/post/1261289/


All Articles