Remote Spark Job Execution in an HDInsight Cluster

I am trying to automatically run a Spark task in a Microsoft Azure HDInsight cluster. I know that there are several ways to automate the submission of a Hadoop job (provided by Azure), but so far I have not been able to find a way to remotely start Spark with RDP setup using the main instance.

Is there any way to achieve this?

+6
source share
3 answers

Spark-jobserver provides a RESTful interface for sending and managing Apache Spark tasks, banks and work contexts.

https://github.com/spark-jobserver/spark-jobserver

My solution uses both Scheduler and Spark-jobserver to run Spark-job periodically.

+4
source

At the time of this writing, there seems to be no official way to achieve this. However, so far I have been able to somehow remotely run Spark jobs using the Oozie shell workflow. This is nothing more than a patch, but so far it has been useful to me. These are the following steps:

The necessary conditions

  • Microsoft powershell
  • Azure Powershell

Process

Define Oozie * .xml * file workflow:

<workflow-app name="myWorkflow" xmlns="uri:oozie:workflow:0.2"> <start to = "myAction"/> <action name="myAction"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <exec>myScript.cmd</exec> <file>wasb:// myContainer@myAccount.blob.core.windows.net /myScript.cmd#myScript.cmd</file> <file>wasb:// myContainer@myAccount.blob.core.windows.net /mySpark.jar#mySpark.jar</file> </shell> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app> 

Please note that it is not possible to identify which HDInsight node the script will be executed on, so it must be placed with the Spark.jar application in the wasb repository. It is then redirected to the local directory on which the Oozie job runs.

Define a custom script

 C:\apps\dist\spark-1.2.0\bin\spark-submit --class spark.azure.MainClass --master yarn-cluster --deploy-mode cluster --num-executors 3 --executor-memory 2g --executor-cores 4 mySpark.jar 

You need to upload both .cmd and Spark.jar to the wasb repository (a process that is not included in this answer), specifically in the direction indicated in the workflow:

wasb:// myContainer@myAccount.blob.core.windows.net /

Define powershell script

The power part of the script is very much taken from the official Oozie on HDInsight tutorial . I do not include a script in this answer because of its almost absolute identity with my approach.

I made a new offer in the azure feedback portal , which indicated the need for official support for the remote application for the Spark vacancy.

+2
source

Updated 08.17.2012: At present, our spark cluster kit includes a Livy server, which provides a recreation service for submitting a spark task. You can automate sparking through Azure Data Factory.


Original post: 1) Remote submission of a task for a spark is not currently supported.

2) If you want to automate the setup of the wizard every time (for example, every time you execute --master yarn-client), you can set the value in the% SPARK_HOME \ conf \ spark-defaults.conf file with the following configuration

spark.master yarn-client

For more information on spark-defaults.conf, visit apache spark.

3) Use the cluster configuration function if you want to add this automatically to the spark-defaults.conf file during deployment.

+1
source

Source: https://habr.com/ru/post/982555/


All Articles