Apache Toree for connecting to a remote spark block

Is there a way to connect Apache Toree to a remote spark cluster? I see the general team

jupyter toree install --spark_home=/usr/local/bin/apache-spark/

How can I use spark on a remote server without having to install locally?

+4
source share
2 answers

There really is a way to get Toree to connect to a remote Spark cluster.

The easiest way I've discovered is to clone an existing Toree Scala / Python core and create a new Toree Scala / Python Remote core. That way you can be able to run locally or remotely.

Steps:

  • . Toree : /usr/local/share/jupyter/kernels/, :
    cp -pr /usr/local/share/jupyter/kernels/apache_toree_scala/ /usr/local/share/jupyter/kernels/apache_toree_scala_remote/

  • kernel.json /usr/local/share/jupyter/kernels/apache_toree_scala_remote/ Spark __TOREE_SPARK_OPTS__. --master <path>, -num-executors, -executor-memory ..

  • Jupyter.

kernel.json :

{
  "display_name": "Toree - Scala Remote",
  "argv": [
    "/usr/local/share/jupyter/kernels/apache_toree_scala_remote/bin/run.sh",
    "--profile",
    "{connection_file}"
  ],
  "language": "scala",
  "env": {
    "PYTHONPATH": "/opt/spark/python:/opt/spark/python/lib/py4j-0.9-src.zip",
    "SPARK_HOME": "/opt/spark",
    "DEFAULT_INTERPRETER": "Scala",
    "PYTHON_EXEC": "python",
    "__TOREE_OPTS__": "",
    "__TOREE_SPARK_OPTS__": "--master spark://192.168.0.255:7077 --deploy-mode client --num-executors 4 --executor-memory 4g --executor-cores 8 --packages com.databricks:spark-csv_2.10:1.4.0"
  }
}
+3

. , Cloudera 5.9.2, . ( Cloudera .)

OS/X CDH ( ):

  • https://github.com/Myllyenko/incubator-toree

  • Docker

  • "" - , - . TBD

  • ' git', .travis.xml, README.md build.sbt, 5.10.x 5.9.2

  • Docker, cd make release, make release, , , 3

  • ./dist/toree-pip/toree-0.2.0-spark-1.6.0-cdh5.9.2.tar.gz , YARN Spark cluster

  • , .. ,

Spark Machine:

. , root

  • pip/anaconda (. )

  • Jupyter sudo pip install jupyter

  • toree sudo pip install toree-0.2.0-spark-1.6.0-cdh5.9.2 apache-toree

Toree Jupyter (): ~/.bash_profile

echo $PATH
PATH=$PATH:$HOME/bin
export PATH
echo $PATH

export CDH_SPARK_HOME=/opt/cloudera/parcels/CDH/lib/spark
export SPARK_HOME=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/spark
export PYTHONPATH=$PYTHONPATH:$SPARK_HOME/python:$SPARK_HOME/python/lib
export SPARK_CONF_DIR=/etc/spark/conf
export HADOOP_HOME=/opt/cloudera/parcels/CDH/lib/hadoop
PATH=$PATH:$SPARK_HOME/bin
export PATH
echo $PATH

export SPARK_PKGS=$(cat << END | xargs echo | sed 's/ /,/g'
com.databricks:spark-csv_2.10:1.5.0
END
)

export SPARK_JARS=$(cat << END | xargs echo | sed 's/ /,/g'
/home/mymachine/extras/someapp.jar
/home/mymachine/extras/jsoup-1.10.3.jar
END
)

export TOREE_JAR="/usr/local/share/jupyter/kernels/apache_toree_scala/lib/toree-assembly-0.2.0-spark-1.6.0-cdh5.9.2-incubating.jar"

export SPARK_OPTS="--master yarn-client --conf spark.yarn.config.gatewayPath=/opt/cloudera/parcels --conf spark.scheduler.mode=FAIR --conf spark.executor.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.driver.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.am.extraLibraryPath=/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop/lib/native:/opt/cloudera/parcels/CDH-5.9.2-1.cdh5.9.2.p0.3/lib/hadoop --conf spark.yarn.historyServer.address=http://yourCDHcluster.net:18088  --conf spark.default.parallelism=20  --conf spark.driver.maxResultSize=1g --conf spark.driver.memory=1g  --conf spark.executor.cores=4 --conf spark.executor.instances=5  --conf spark.executor.memory=1g --packages $SPARK_PKGS --jars $SPARK_JARS"

function jti() {
    jupyter toree install \
    --replace \
    --user \
    --kernel_name="CDH 5.9.2 Toree" \
    --debug \
    --spark_home=${SPARK_HOME} \
    --spark_opts="$SPARK_OPTS" \
    --log-level=0
}
function jn() {
    jupyter notebook --ip=127.0.0.1 --port=8888 --debug --log-level=0
}

, Toree, 8888

  1. Toree/spark-shell

  2. ssh ssh -L 8888:localhost:8888 toreebox.cdhcluster.net (, 8888 - bash)

  3. , ( root) jti Toree Jupyter (: Jupyter - sidebar: @jamcom, , . , root.

  4. jn, Jupyter Notebook. , URL- URL .

  5. Jupyter CDH 5.9.2 Toree , . . Toree, - sc.getConf.getAll.sortWith(_._1 < _._1).foreach(println), . , , , , , .

:

, . ( ​​ github)

"Apache Toree" YARN, Toree.

JVM. , Jupyter Notebook/Toree , top. ... JVM ( ).

0

Source: https://habr.com/ru/post/1670112/


All Articles