Link command line options and environment variables for Spark?

Question

Link command line options and environment variables for Spark?

I am looking for a complete link to command line parameters, environment variables, and configuration files, especially how they relate to each other and take precedence.

Thanks:)

Known Resources

I found the only documentation , but it does not clearly describe the relationship between different variables / parameters and which take precedence over others.
The configuration documentation provides a good overview of application properties, but not for startup / slave startup parameters.

Problem example

The offline documentation says the following:

The following configuration parameters can be transferred to the master and worker
...
-d DIR, --work-dir DIR Directory for use in logs with spaces and output logs (default: SPARK_HOME / work); only per employee

and later

SPARK_LOCAL_DIRS directory to use for scratches in Spark
SPARK_WORKER_DIR directory for launching applications, which will include both logs and the space with errors (default: SPARK_HOME / work).

As a beginner spark, I'm a little confused.

What is the relationship between SPARK_LOCAL_DIRS , SPARK_WORKER_DIR and -d .
What if I point them all to different values - this takes precedence.
Do the variables written in $SPARK_HOME/conf/spark-env.sh precedence over the variable defined in the original spark of the shell / script?

The perfect solution

What I'm looking for is a single link that

prioritizes various ways of specifying variables for a spark and
lists all variables / parameters.

For example, something like this:

 Varialble | Cmd-line | Default | Description SPARK_MASTER_PORT | -p --port | 8080 | Port for master to listen on SPARK_SLAVE_PORT | -p --port | random | Port for slave to listen on SPARK_WORKER_DIR | -d --dir | $SPARK_HOME/work | Used as default for worker data SPARK_LOCAL_DIRS | | $SPARK_WORKER_DIR| Scratch space for RDD's .... | .... | .... | ....

+6

documentation apache-spark

Tobber Jan 29 '15 at 16:08

source share

1 answer

Tobber · Answer 1 · 2015-02-04T10:43:08+0000

So, the short answer seems to be: such documentation does not exist. I created a request for it on JIRA , ~~so I hope it will be fixed in the future~~ , but it was closed as It will not be fixed (February 2016).

Extraordinary

I did a little test and found that priority:

Command line options are used first
conf/spark-env.sh used when command line options are missing
The environment variables are used last - perhaps because spark-env.sh overwrites them

Here you can see the full script. For completeness:

 #This uses /tmp/sparktest/cmdline/ echo "SPARK_WORKER_DIR=/tmp/sparktest/file/" > $SPARK_HOME/conf/spark-env.sh SPARK_WORKER_DIR=/tmp/sparktest/envvar/ $SPARK_HOME/sbin/start-slave.sh 1 spark://$LOCAL_HOSTNAME:7077 -d /tmp/sparktest/cmdline/ #This uses /tmp/sparktest/file/ echo "SPARK_WORKER_DIR=/tmp/sparktest/file/" > $SPARK_HOME/conf/spark-env.sh SPARK_WORKER_DIR=/tmp/sparktest/envvar/ $SPARK_HOME/sbin/start-slave.sh 1 spark://$LOCAL_HOSTNAME:7077 #This uses /tmp/sparktest/envvar/ echo "" > $SPARK_HOME/conf/spark-env.sh SPARK_WORKER_DIR=/tmp/sparktest/envvar/ $SPARK_HOME/sbin/start-slave.sh 1 spark://$LOCAL_HOSTNAME:7077

Link command line options and environment variables for Spark?

Known Resources

Problem example

The perfect solution

Extraordinary

More articles: