A spark generator with 32 GB or more memory has encountered a fatal error

I have three slaves in an autonomous Spark Cluster. Each slave device has 48 GB of RAM. When I assigned my artists more than 31 GB (for example, 32 GB or more) of RAM:

.config("spark.executor.memory", "44g") 

the performers were discontinued without much information during the merger of the two large Dataframes. The output message in the Slave driver showed "no shuffle output":

 17/09/21 12:34:18 INFO StandaloneSchedulerBackend: Granted executor ID app-20170921123240-0000/3 on hostPort XXX.XXX.XXX.92:33705 with 6 cores, 44.0 GB RAM 17/09/21 12:34:18 WARN TaskSetManager: Lost task 14.0 in stage 7.0 (TID 124, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 17/09/21 12:34:18 WARN TaskSetManager: Lost task 5.0 in stage 7.0 (TID 115, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 17/09/21 12:34:18 WARN TaskSetManager: Lost task 17.0 in stage 7.0 (TID 127, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 17/09/21 12:34:18 WARN TaskSetManager: Lost task 8.0 in stage 7.0 (TID 118, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 17/09/21 12:34:18 WARN TaskSetManager: Lost task 2.0 in stage 7.0 (TID 112, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 17/09/21 12:34:18 WARN TaskSetManager: Lost task 11.0 in stage 7.0 (TID 121, XXX.XXX.XXX.92, executor 0): ExecutorLostFailure (executor 0 exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages. 17/09/21 12:34:18 INFO DAGScheduler: Executor lost: 0 (epoch 5) 17/09/21 12:34:18 INFO BlockManagerMaster: Removal of executor 0 requested 17/09/21 12:34:18 INFO CoarseGrainedSchedulerBackend$DriverEndpoint: Asked to remove non-existent executor 0 17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster. 17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_2 ! 17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_11 ! 17/09/21 12:34:18 INFO StandaloneAppClient$ClientEndpoint: Executor updated: app-20170921123240-0000/3 is now RUNNING 17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_5 ! 17/09/21 12:34:18 WARN BlockManagerMasterEndpoint: No more replicas available for rdd_10_8 ! 17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Removing block manager BlockManagerId(0, XXX.XXX.XXX, 34840, None) 17/09/21 12:34:18 INFO BlockManagerMasterEndpoint: Trying to remove executor 0 from BlockManagerMaster. 17/09/21 12:34:18 INFO BlockManagerMaster: Removed 0 successfully in removeExecutor 

A report from Spark Master magazine showed that the performers were "EXITED" and then restarted:

 17/09/21 12:34:18 INFO Master: Removing executor app-20170921123240-0000/0 because it is EXITED 17/09/21 12:34:18 INFO Master: Launching executor app-20170921123240-0000/3 on worker worker-20170921123014-152.83.247.92-33705 

A Spark Worker magazine message showed that the artist exited with code 134

 17/09/21 12:34:18 INFO Worker: Executor app-20170921123240-0000/0 finished with state EXITED message Command exited with code 134 exitStatus 134 

The only key appears to be in the application error log, indicating that the JAR has detected a fatal error:

 # # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007fdec0c92a73, pid=11300, tid=0x00007fd3a6951700 # # JRE version: Java(TM) SE Runtime Environment (8.0_131-b11) (build 1.8.0_131-b11) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.131-b11 mixed mode linux-amd64 ) # Problematic frame: # V [libjvm.so+0x3ffa73] CardTableExtension::scavenge_contents_parallel(ObjectStartArray*, MutableSpace*, HeapWord*, PSPromotionManager*, unsigned int, unsigned int)+0x5e3 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp # --------------- THREAD --------------- Current thread (0x0000000001c9e800): GCTaskThread [stack: 0x00007fd3a6851000,0x00007fd3a6952000] [id=11308] siginfo: si_signo: 11 (SIGSEGV), si_code: 1 (SEGV_MAPERR), si_addr: 0x0000000000000008 

While I assign 31 gigabytes of RAM (or less) to each artist, my program works fine. Has anyone encountered such a problem before?

+5
source share
1 answer

44 GB can actually give you a smaller useful heap than 31 GB due to the way Java stores links to objects: for heap sizes over 32 GB, the JVM should switch to 64-bit links to objects, which means that all objects take up more space. More details here: http://java-performance.info/over-32g-heap-java/

My rule is to either stay below 32 GB or go much higher (say 50 GB). It is usually more efficient to use multiple JVMs, each of which has less than 32 GB. With 48 GB of RAM, I would stick with a heap of 31 GB.

+1
source

Source: https://habr.com/ru/post/1272000/


All Articles