What type of cluster should I choose for Spark?

I'm new to Apache Spark, and I just found out that Spark supports three types of clusters:

  • Autonomous - means Spark will manage its own cluster
  • YARN - Using the Hadoop YARN Resource Manager
  • Mesos - Apache Resource Management Project

Since I'm new to Spark, I think I should try Autonomous first . But I wonder which one is recommended. Say, in the future I need to build a large cluster (hundreds of instances), what type of cluster should I go to?

+46
yarn apache-spark mesos apache-spark-standalone
Feb 22 '15 at 23:44
source share
3 answers

I think the best answer is those who work for Spark. So from Learning Spark

Start with an offline cluster if this is a new deployment. Offline mode is easiest to configure and will provide almost all the same functions as other cluster administrators, if you are just running Spark.

If you want to run Spark along with other applications or use the richer resource scheduling capabilities (for example, queues), both YARN and These functions are provided by Mesos. Of these, YARN is likely to be preinstalled on many Hadoop distributions.

One of the advantages of Mesos for both YARN and standalone mode is its fine-grained sharing option, which allows interactive applications as the Spark shell reduces the distribution of CPU between teams. This makes it attractive in environments in which the launch of interactive shells.

In all cases, it is best to run Spark on the same nodes as HDFS for quick access to storage. You can install Mesos or a stand-alone cluster manager on the same nodes manually, or most Hadoop distributions already install YARN and HDFS together.

+48
Feb 23 '15 at 2:09
source share

Spark Standalone Manager : A simple cluster manager included with Spark that simplifies cluster configuration. By default, each application uses all available nodes in the cluster.

Several advantages of YARN over autonomous and mesos:

  • YARN allows you to dynamically share and centrally configure the same cluster resource pool between all the frameworks that run on YARN.

  • You can use all the features of YARN schedulers to categorize, highlight, and prioritize workloads.

  • In standalone mode, Spark requires each application to run an artist on each node in the cluster; whereas with YARN you choose the number of artists to use

  • YARN directly processes the location of the rack and the machine in your requests, which is convenient.

  • Strange as it may seem, the resource request model is the reverse in Mesos. In YARN, you (the framework) request containers with a given specification and specify local settings. At Mesos, you receive resource β€œoffers” and choose to accept or decline those that are based on your own planning policy. The Mesos model is perhaps more flexible, but appears to be more efficient for the person implementing the infrastructure.

  • If you already have a large Hadoop cluster, it is best to choose YARN .

  • The stand-alone manager requires the user to configure each of the nodes with a shared secret. The default Mesos authentication module, Cyrus SASL, can be replaced by a custom module. YARN has protection for authentication, service level authorization, authentication for web consoles, and data privacy. Authentication Hadoop uses Kerberos to authenticate each user and Kerberos service.

  • High availability is provided by all three cluster managers, but Hadoop YARN does not need to run a separate ZooKeeper fault tolerance controller.

Useful links:

spark

agildata article

+45
Jan 07 '16 at 14:32
source share

Standalone is pretty clear, as others mention that it should only be used when you have only a workload.

Between yarn and Mesos, one thing to keep in mind is that, unlike mapreduce, a spark task captures performers and holds it throughout the work. where in mapreduce work can receive and produce cartographers and reducers throughout life.

if you have long spark tasks that during the entire life cycle of the work do not fully use all the resources that he received at the beginning, you can share these resources with another application and that you can only perform through Mesos or Dynamic Spark scheduling. https://spark.apache.org/docs/2.0.2/job-scheduling.html#scheduling-across-applications Thus, with yarn, only the method has a dynamic distribution for the spark, using a spark that provides dynamic distribution. Yarn will not interfere with this while Mesos is. Again, this whole moment matters only if you have a long spark application and you would like to scale it up and down dynamically.

+5
Dec 08 '16 at 17:13
source share



All Articles