Flink on YARN with HA enabled causes all RMs to fail when retrying

Question

Flink on YARN with HA enabled causes all RMs to fail when retrying

I am trying to get Flink (1.2.0) to work on our Hadoop cluster (CDH 5.10.0) with HA enabled, but when I test it by killing active RM, it destroys the entire cluster.

I configured Flink HA in flink-conf.yml:

high-availability: zookeeper
high-availability.zookeeper.quorum: zookeeper1:2181,zookeeper2:2181,zookeeper3:2181
high-availability.zookeeper.storageDir: hdfs:///tmp/flink/recovery
high-availability.zookeeper.path.root: /flink
high-availability.zookeeper.path.namespace: /cluster1
yarn.application-attempts: 2

Then I start a flink session using yarn-session.sh -n 2 -nm "Flink HA test"

When I try to kill an active RM with kill -9, YARN switches correctly to the backup RM, and I can see the application as ACCEPTEDfor a minute, but soon the RM backup throws the following exception:

2017-03-08 12:29:36,997 FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_ADDED to the scheduler
java.lang.NullPointerException
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerApplicationAttempt.transferStateFromPreviousAttempt(SchedulerApplicationAttempt.java:601)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.addApplicationAttempt(FairScheduler.java:698)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1303)
    at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:123)
    at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:702)
    at java.lang.Thread.run(Thread.java:745)