What is the difference between Apache Mahout and Apache Spark MLlib?

Browse a MySQL products database with 10 million products for an e-commerce website.

I am trying to create a classification module for categorizing products. I am using Apache Sqoop to import data from MySQL into Hadoop.

I wanted to use Mahout over it as a machine learning mechanism to use one of the classification algorithms , and then I came across Spark, which is equipped with MLlib

  • What is the difference between the two structures?
  • Basically, what are the advantages, disadvantages and limitations of each?
+47
mahout apache-spark apache-spark-mllib
May 7 '14 at 7:30
source share
2 answers

The main difference from the basic frameworks. In the case of Mahout, this is Hadoop MapReduce, and in the case of MLib, Spark. To be more specific - from the difference in overhead
If your ML algorithm is mapped to one MR job, the main difference will be only the launch overhead, which is tens of seconds for Hadoop MR, and let it say 1 second for Spark. Therefore, in the case of model training, this is not so important.
Everything will be different if your algorithm compares with many tasks. In this case, we will have the same difference in overhead for iteration, and this may be a change of game.
Suppose we need 100 iterations, each of which takes 5 seconds of a cluster processor.

  • On Spark: It takes 100 * 5 + 100 * 1 seconds = 600 seconds.
  • On Hadoop: MR (Mahout) it will take 100 * 5 + 100 * 30 = 3500 seconds.

At the same time, Hadoop MR is a much more mature structure than Spark, and if you have a lot of data and stability is paramount - I would consider Mahout as a serious alternative.

+38
May 7 '14 at 10:45
source share

Warning - major change:

MLlib is a free collection of high-level algorithms that runs on Spark. It is that Mahu was only Mahu from the old, was on the Hadoop Mapreduce. In 2014, Mahout announced that he would no longer accept the Hadoop Mapreduce code and completely switched the new development to Spark (with other engines, possibly offline, such as H2O).

The most important thing that will come of this is a generator with an extended distributed optimized Scala linear algebra engine, including the Scala interactive shell. Perhaps the most important word is "generalized." Since it runs on Spark, everything available in MLlib can be used with the Mahout-Spark linear algebra engine.

If you need a generic engine that will do a lot of tools like R, but on really big data, look at Mahout. If you need a specific algorithm, look at each one to see what they have. For example, Kmeans runs in MLlib, but if you need to put A'A (the cooccurrence matrix used in the recommendations), you will need both of them because MLlib does not have transposition of the matrix or A'A (in fact, Mahout does subtly optimized A'A, so transposition is optimized).

Mahout also includes several innovative recommender building blocks that offer things found in no other OSS.

Mahout still has its older Hadoop algorithms, but as fast computing engines like Spark become the norm that most people will invest in it.

+38
May 8 '14 at 6:08
source share



All Articles