At the time of this writing, there seems to be no official way to achieve this. However, so far I have been able to somehow remotely run Spark jobs using the Oozie shell workflow. This is nothing more than a patch, but so far it has been useful to me. These are the following steps:
The necessary conditions
- Microsoft powershell
- Azure Powershell
Process
Define Oozie * .xml * file workflow:
<workflow-app name="myWorkflow" xmlns="uri:oozie:workflow:0.2"> <start to = "myAction"/> <action name="myAction"> <shell xmlns="uri:oozie:shell-action:0.2"> <job-tracker>${jobTracker}</job-tracker> <name-node>${nameNode}</name-node> <configuration> <property> <name>mapred.job.queue.name</name> <value>${queueName}</value> </property> </configuration> <exec>myScript.cmd</exec> <file>wasb:// myContainer@myAccount.blob.core.windows.net /myScript.cmd#myScript.cmd</file> <file>wasb:// myContainer@myAccount.blob.core.windows.net /mySpark.jar#mySpark.jar</file> </shell> <ok to="end"/> <error to="fail"/> </action> <kill name="fail"> <message>Shell action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message> </kill> <end name="end"/> </workflow-app>
Please note that it is not possible to identify which HDInsight node the script will be executed on, so it must be placed with the Spark.jar application in the wasb repository. It is then redirected to the local directory on which the Oozie job runs.
Define a custom script
C:\apps\dist\spark-1.2.0\bin\spark-submit --class spark.azure.MainClass --master yarn-cluster --deploy-mode cluster --num-executors 3 --executor-memory 2g --executor-cores 4 mySpark.jar
You need to upload both .cmd and Spark.jar to the wasb repository (a process that is not included in this answer), specifically in the direction indicated in the workflow:
wasb:// myContainer@myAccount.blob.core.windows.net /
Define powershell script
The power part of the script is very much taken from the official Oozie on HDInsight tutorial . I do not include a script in this answer because of its almost absolute identity with my approach.
I made a new offer in the azure feedback portal , which indicated the need for official support for the remote application for the Spark vacancy.
source share