Spark app kills artist

I am running the spark cluster offline and the application using spark-submit. In the section of the scene with a spark UI, I found a runtime scene with a long runtime (> 10h when normal time is ~ 30 seconds). The stage has many failed tasks with a Resubmitted (resubmitted due to lost executor) error. The stage page has an artist with the address CANNOT FIND ADDRESS in the Aggregated Metrics by Executor . The spark is trying to repeat this task endlessly. If I kill this scene (my application re-runs unfinished spark jobs automatically), everything continues to work well.

I also found some strange spark log entries (at the same time as starting the launch).

Teacher:

 16/11/19 19:04:32 INFO Master: Application app-20161109161724-0045 requests to kill executors: 0 16/11/19 19:04:36 INFO Master: Launching executor app-20161109161724-0045/1 on worker worker-20161108150133 16/11/19 19:05:03 WARN Master: Got status update for unknown executor app-20161109161724-0045/0 16/11/25 10:05:46 INFO Master: Application app-20161109161724-0045 requests to kill executors: 1 16/11/25 10:05:48 INFO Master: Launching executor app-20161109161724-0045/2 on worker worker-20161108150133 16/11/25 10:06:14 WARN Master: Got status update for unknown executor app-20161109161724-0045/1 

Working:

 16/11/25 10:06:05 INFO Worker: Asked to kill executor app-20161109161724-0045/1 16/11/25 10:06:08 INFO ExecutorRunner: Runner thread for executor app-20161109161724-0045/1 interrupted 16/11/25 10:06:08 INFO ExecutorRunner: Killing process! 16/11/25 10:06:13 INFO Worker: Executor app-20161109161724-0045/1 finished with state KILLED exitStatus 137 16/11/25 10:06:14 INFO Worker: Asked to launch executor app-20161109161724-0045/2 for app.jar 16/11/25 10:06:17 INFO SecurityManager: Changing view acls to: spark 16/11/25 10:06:17 INFO SecurityManager: Changing modify acls to: spark 16/11/25 10:06:17 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(spark); users with modify permissions: Set(spark) 

There are no problems with network connections, because the worker, the wizard (the logs above), the driver runs on the same computer.

Spark version 1.6.1

+6
source share
2 answers

Probably the interesting part of the magazine is this:

 16/11/25 10:06:13 INFO Worker: Executor app-20161109161724-0045/1 finished with state KILLED exitStatus 137 

Exit 137 strongly suggests a problem with resources, memory, or processor cores. Given that you can fix your problems by going to the scene, it may be that all the kernels are already selected (maybe you also have some kind of spark shell?). This is a common problem with stand-alone Spark settings (all on the same host).

In any case, I would try the following:

  • Raise the spark.storage.memoryFraction memory spark.storage.memoryFraction to pre-allocate more memory to prevent the automatic OOM system killer from accidentally giving you 137 on the big stage.
  • Install fewer cores for your application to eliminate anything that precedes the distribution of these cores before your scene starts. You can do this through spark.deploy.defaultCores , set it to 3 or even 2 (on an Intel quad-core processor with 8 vectors).
  • Send the allocation of more RAM to Spark - spark.executor.memory needs to be increased.
  • Perhaps you are faced with the problem of cleaning metadata here, also not unheard of in local deployments, in this case adding export SPARK_JAVA_OPTS +="-Dspark.kryoserializer.buffer.mb=10 -Dspark.cleaner.ttl=43200" to the end of your spark-env.sh can do the trick by clearing metadata more often.

One of them should do the trick, in my opinion.

+6
source

Armin answer is very good. I just wanted to point out what worked for me.

The same problem disappeared when I increased the parameter:

spark.default.parallelism from 28 (which was the number of artists that I had) to 84 (this is the number of available cores).

NOTE: it is not a rule to set this parameter, it is just what worked for me.

UPDATE . This approach is also supported by Evaporation Documentation :

Sometimes you get an OutOfMemoryError not because your RDDs do not fit into memory, but because the working set of one of your tasks, for example, one of the reduction tasks in groupByKey, is too large. Sparks shuffle operations (sortByKey, groupByKey, reduceByKey, join, etc.) Create a hash table in each task to perform grouping, which can often be large. The simplest solution here is to increase the level of parallelism, so each set of task input is smaller. Spark can effectively support tasks as shorter than 200 ms, because it reuses one JVM executor for many tasks and has a low task launch cost, so you can safely increase the level of parallelism to more cores in your clusters.

+3
source

Source: https://habr.com/ru/post/1012806/


All Articles