How to send Apache Spark to Hadoop YARN on Azure HDInsight

Question

How to send Apache Spark to Hadoop YARN on Azure HDInsight

I am very pleased that HDInsight has switched to a version of Hadoop 2 that supports Apache Spark via YARN. Apache Spark is a much more appropriate parallel programming paradigm than MapReduce for the task I want to accomplish.

I was unable to find any documentation on how to remotely submit an Apache Spark job to my HDInsight cluster. To remotely set MapReduce standard jobs, I know that there are several REST endpoints such as Templeton and Oozie. But, as I was able to find, launching Spark is not possible through Templeton. I found it possible to include Spark jobs in Oozie, but I read that this is a very tedious thing, and I also read some messages about failure detection that do not work in this case.

There should probably be a better way to submit Spark jobs. Does anyone know how to do remote jobs for Apache Spark for HDInsight?

Thank you very much in advance!

+6

azure apache-spark hdinsight

Niek tax Jul 10 '14 at 9:14

source share

3 answers

I also asked the same question with the Azure guys. Below is the solution:

"Two questions to the topic: 1. How can we send work outside the cluster without the" Remote ... "- Tao Li

This feature is currently not supported. One way to solve this problem is to create a web service for submitting a task:

Create a Scala web service that will use the Spark API to run jobs in the cluster.
Accept this web service in a virtual machine inside the same VNet as the cluster.
Export the endpoint of the web service externally through some authentication scheme. You can also use the work to reduce the intermediate card, it will take longer, though.

0

Tao li Apr 24 '15 at 1:45

source share

You can use Brisk ( https://brisk.elastatools.com ), which offers Spark on Azure as a provided service (with support). There is a free tier there and it allows you to access the blob repository using wasb: // path / to / files, like HDInsight.

He is not sitting on YARN; instead, it is an easy and Azure-oriented Spark distribution.

Disclaimer: I am working on a project!

Best wishes,

Andy

-1

AndyElastacloud 10 sept. '14 at 19:52

source share

lockwobr · Accepted Answer · 2015-02-11T16:59:45+0000

You can set the spark in hdinsight cluster. You must do this by creating a custom cluster and add a script action that will install Spark in the cluster at the time the VM is created for the cluster.

Installing using a script action in installing a cluster is pretty easy, you can do it in C # or powershell by adding a few lines of code for a standard custom cluster script / program.

PowerShell:

# ADD SCRIPT ACTION TO CLUSTER CONFIGURATION $config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection HeadNode -Urin https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1

WITH#:

 // ADD THE SCRIPT ACTION TO INSTALL SPARK clusterInfo.ConfigActions.Add(new ScriptAction( "Install Spark", // Name of the config action new ClusterNodeType[] { ClusterNodeType.HeadNode }, // List of nodes to install Spark on new Uri("https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1"), // Location of the script to install Spark null //because the script used does not require any parameters. ));

you can then run RDP in the headnode and run the spark shell or use spark-submit to run the jobs. I'm not sure how to fire a spark, not rdp in headnode, but that is another question.

How to send Apache Spark to Hadoop YARN on Azure HDInsight

"Two questions to the topic: 1. How can we send work outside the cluster without the" Remote ... "- Tao Li

More articles: