We have a Spark application that continuously processes many inbound jobs. Several jobs are processed in parallel, on multiple threads.
During intense workloads, at some point, we begin to have these warnings:
16/12/14 21:04:03 WARN JobProgressListener: Task end for unknown stage 147379 16/12/14 21:04:03 WARN JobProgressListener: Job completed for unknown job 64610 16/12/14 21:04:04 WARN JobProgressListener: Task start for unknown stage 147405 16/12/14 21:04:04 WARN JobProgressListener: Task end for unknown stage 147406 16/12/14 21:04:04 WARN JobProgressListener: Job completed for unknown job 64622
Starting with this, the performance of the plumb application, most of the stages and tasks never end. In SparkUI, I see numbers, for example, 13,000 pending / active jobs.
I cannot see that another exception happens earlier with additional information. Perhaps this one, but it concerns another listener:
16/12/14 21:03:54 ERROR LiveListenerBus: Dropping SparkListenerEvent because no remaining room in event queue. This likely means one of the SparkListeners is too slow and cannot keep up with the rate at which tasks are being started by the scheduler. 16/12/14 21:03:54 WARN LiveListenerBus: Dropped 1 SparkListenerEvents since Thu Jan 01 01:00:00 CET 1970
This is a very nasty problem, because it is not an obvious crash, or a clear ERROR message that we could catch in order to restart the application.
UPDATE:
- The problem occurs with Spark 2.0.2 and Spark 2.1.1
- Most likely related to SPARK-18838
What bothers me the most is that I expect this to happen with large configurations (a large cluster will make it easier to use a DDOS driver with the results of the task), but it is not. Our cluster is a bit small, the only feature is that we tend to mix small and large files for processing, and small files generate a lot of tasks that end quickly.
source share