How to send Apache Spark to Hadoop YARN on Azure HDInsight

I am very pleased that HDInsight has switched to a version of Hadoop 2 that supports Apache Spark via YARN. Apache Spark is a much more appropriate parallel programming paradigm than MapReduce for the task I want to accomplish.

I was unable to find any documentation on how to remotely submit an Apache Spark job to my HDInsight cluster. To remotely set MapReduce standard jobs, I know that there are several REST endpoints such as Templeton and Oozie. But, as I was able to find, launching Spark is not possible through Templeton. I found it possible to include Spark jobs in Oozie, but I read that this is a very tedious thing, and I also read some messages about failure detection that do not work in this case.

There should probably be a better way to submit Spark jobs. Does anyone know how to do remote jobs for Apache Spark for HDInsight?

Thank you very much in advance!

+6
source share
3 answers

You can set the spark in hdinsight cluster. You must do this by creating a custom cluster and add a script action that will install Spark in the cluster at the time the VM is created for the cluster.

Installing using a script action in installing a cluster is pretty easy, you can do it in C # or powershell by adding a few lines of code for a standard custom cluster script / program.

PowerShell:

# ADD SCRIPT ACTION TO CLUSTER CONFIGURATION $config = Add-AzureHDInsightScriptAction -Config $config -Name "Install Spark" -ClusterRoleCollection HeadNode -Urin https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1 

WITH#:

 // ADD THE SCRIPT ACTION TO INSTALL SPARK clusterInfo.ConfigActions.Add(new ScriptAction( "Install Spark", // Name of the config action new ClusterNodeType[] { ClusterNodeType.HeadNode }, // List of nodes to install Spark on new Uri("https://hdiconfigactions.blob.core.windows.net/sparkconfigactionv02/spark-installer-v02.ps1"), // Location of the script to install Spark null //because the script used does not require any parameters. )); 

you can then run RDP in the headnode and run the spark shell or use spark-submit to run the jobs. I'm not sure how to fire a spark, not rdp in headnode, but that is another question.

+3
source

I also asked the same question with the Azure guys. Below is the solution:

"Two questions to the topic: 1. How can we send work outside the cluster without the" Remote ... "- Tao Li

This feature is currently not supported. One way to solve this problem is to create a web service for submitting a task:

  • Create a Scala web service that will use the Spark API to run jobs in the cluster.
  • Accept this web service in a virtual machine inside the same VNet as the cluster.
  • Export the endpoint of the web service externally through some authentication scheme. You can also use the work to reduce the intermediate card, it will take longer, though.
0
source

You can use Brisk ( https://brisk.elastatools.com ), which offers Spark on Azure as a provided service (with support). There is a free tier there and it allows you to access the blob repository using wasb: // path / to / files, like HDInsight.

He is not sitting on YARN; instead, it is an easy and Azure-oriented Spark distribution.

Disclaimer: I am working on a project!

Best wishes,

Andy

-1
source

Source: https://habr.com/ru/post/972015/


All Articles