Link command line options and environment variables for Spark?

I am looking for a complete link to command line parameters, environment variables, and configuration files, especially how they relate to each other and take precedence.

Thanks:)

Known Resources

Problem example

The offline documentation says the following:

The following configuration parameters can be transferred to the master and worker

...

-d DIR, --work-dir DIR Directory for use in logs with spaces and output logs (default: SPARK_HOME / work); only per employee

and later

SPARK_LOCAL_DIRS directory to use for scratches in Spark

SPARK_WORKER_DIR directory for launching applications, which will include both logs and the space with errors (default: SPARK_HOME / work).

As a beginner spark, I'm a little confused.

  • What is the relationship between SPARK_LOCAL_DIRS , SPARK_WORKER_DIR and -d .
  • What if I point them all to different values ​​- this takes precedence.
  • Do the variables written in $SPARK_HOME/conf/spark-env.sh precedence over the variable defined in the original spark of the shell / script?

The perfect solution

What I'm looking for is a single link that

  • prioritizes various ways of specifying variables for a spark and
  • lists all variables / parameters.

For example, something like this:

 Varialble | Cmd-line | Default | Description SPARK_MASTER_PORT | -p --port | 8080 | Port for master to listen on SPARK_SLAVE_PORT | -p --port | random | Port for slave to listen on SPARK_WORKER_DIR | -d --dir | $SPARK_HOME/work | Used as default for worker data SPARK_LOCAL_DIRS | | $SPARK_WORKER_DIR| Scratch space for RDD's .... | .... | .... | .... 
+6
source share
1 answer

So, the short answer seems to be: such documentation does not exist. I created a request for it on JIRA , so I hope it will be fixed in the future , but it was closed as It will not be fixed (February 2016).

Extraordinary

I did a little test and found that priority:

  • Command line options are used first
  • conf/spark-env.sh used when command line options are missing
  • The environment variables are used last - perhaps because spark-env.sh overwrites them

Here you can see the full script. For completeness:

 #This uses /tmp/sparktest/cmdline/ echo "SPARK_WORKER_DIR=/tmp/sparktest/file/" > $SPARK_HOME/conf/spark-env.sh SPARK_WORKER_DIR=/tmp/sparktest/envvar/ $SPARK_HOME/sbin/start-slave.sh 1 spark://$LOCAL_HOSTNAME:7077 -d /tmp/sparktest/cmdline/ #This uses /tmp/sparktest/file/ echo "SPARK_WORKER_DIR=/tmp/sparktest/file/" > $SPARK_HOME/conf/spark-env.sh SPARK_WORKER_DIR=/tmp/sparktest/envvar/ $SPARK_HOME/sbin/start-slave.sh 1 spark://$LOCAL_HOSTNAME:7077 #This uses /tmp/sparktest/envvar/ echo "" > $SPARK_HOME/conf/spark-env.sh SPARK_WORKER_DIR=/tmp/sparktest/envvar/ $SPARK_HOME/sbin/start-slave.sh 1 spark://$LOCAL_HOSTNAME:7077 
+6
source

Source: https://habr.com/ru/post/981717/


All Articles