Add yarn cluster configuration to Spark app

Question

Add yarn cluster configuration to Spark app

I am trying to use a spark on yarn in a scala sbt application instead of using spark-submit directly.

I already have a remote yarn cluster, and I can connect to the spark start of a yarn cluster in SparkR. But when I try to do this in a scala application, it cannot load environment variables in the yarn configuration and instead use the default address and port of the yarn.

The sbt application is just a simple object:

 object simpleSparkApp { def main(args: Array[String]): Unit = { val conf = new SparkConf() .setAppName("simpleSparkApp") .setMaster("yarn-client") .set("SPARK_HOME", "/opt/spark-1.5.1-bin-hadoop2.6") .set("HADOOP_HOME", "/opt/hadoop-2.6.0") .set("HADOOP_CONF_DIR", "/opt/hadoop-2.6.0/etc/hadoop") val sc = new SparkContext(conf) } }

When I run this application in Intellij IDEA, the log says:

 15/11/15 18:46:05 INFO RMProxy: Connecting to ResourceManager at /0.0.0.0:8032 15/11/15 18:46:06 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 0 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) 15/11/15 18:46:07 INFO Client: Retrying connect to server: 0.0.0.0/0.0.0.0:8032. Already tried 1 time(s); retry policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=10, sleepTime=1 SECONDS) ...

It seems that the environment has not been added correctly, because 0.0.0.0 not the ip of the remote node yarn resource manager, but my spark-env.sh has:

 export JAVA_HOME="/usr/lib/jvm/ibm-java-x86_64-80" export HADOOP_HOME="/opt/hadoop-2.6.0" export HADOOP_CONF_DIR="$HADOOP_HOME/etc/hadoop" export SPARK_MASTER_IP="master"

and my yarn-site.xml has:

 <property> <name>yarn.resourcemanager.hostname</name> <value>master</value> </property>

How can I correctly add environment variables in setting up a yarn cluster in this sbt Spark app?

Additional Information:

My system is Ubuntu14.04, and the SparkR code that can connect to a yarn cluster looks like this:

 Sys.setenv(HADOOP_HOME = "/opt/hadoop-2.6.0") Sys.setenv(SPARK_HOME = "/opt/spark-1.4.1-bin-hadoop2.6") .libPaths(c(file.path(Sys.getenv("SPARK_HOME"), "R", "lib"), .libPaths())) library(SparkR) sc <- sparkR.init(master = "yarn-client")

+5

scala hadoop yarn apache-spark

Bamqf Nov 16 '15 at 4:42

source share

1 answer

Elena Viter · Accepted Answer · 2015-11-17T00:19:05+0000

These days there is no solution out of the box to avoid using a spark-feed for Yarn mode.

Spark-submit : to start the task, run the spark-submit command, run the org.apache.spark.deploy.yarn.Client code in the configured environment (or not) as in your case). Here is the Client that performs the task: https://github.com/apache/spark/blob/master/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala

What is the solution?

It was possible to override Client behavior, which can be found here http://blog.sequenceiq.com/blog/2014/08/22/spark-submit-in-java/ so that you can add additional env variables, etc. Spark later made the yarn client private for the spark pack (~ end of 2014). So if you name your package org.apache.spark - perhaps this is an option.
This describes the solution built on top of sparking (with its advantages and disadvantages): http://www.henningpetersen.com/post/22/running-apache-spark-jobs-from-applications

What about SparkR.R, it uses intrinsic safety inside: https://github.com/apache/spark/blob/master/R/pkg/R/sparkR.R when it calls launchBackend () from https: // github. com / apache / spark / blob / master / R / pkg / R / client.R and give there all the environment settings already + arguments

Add yarn cluster configuration to Spark app

More articles: