Why aggregate and stack two different APIs in Spark?

Question

Why aggregate and stack two different APIs in Spark?

Using the standard Scala lib, I can do something like this:

scala> val scalaList = List(1,2,3) scalaList: List[Int] = List(1, 2, 3) scala> scalaList.foldLeft(0)((acc,n)=>acc+n) res0: Int = 6

Create one Int from many Ints.

And I can do something like this:

 scala> scalaList.foldLeft("")((acc,n)=>acc+n.toString) res1: String = 123

Create one line from many ints.

Thus, foldLeft can be either homogeneous or heterogeneous, depending on what we want, in one API.

In Spark, if I want one Int out of many Ints, I can do this:

 scala> val rdd = sc.parallelize(List(1,2,3)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:12 scala> rdd.fold(0)((acc,n)=>acc+n) res1: Int = 6

The fold API is similar to foldLeft, but it is only uniform, RDD [Int] can only produce Int with a bend.

Sparks also has an aggregated API:

 scala> rdd.aggregate("")((acc,n)=>acc+n.toString, (s1,s2)=>s1+s2) res11: String = 132

It is heterogeneous, RDD [Int] can now create a string.

So why are folds and aggregates implemented as two different APIs in Spark?

Why are they not designed as foldLeft, which can be either homogeneous or heterogeneous?

(I am very new to Sparks, please excuse me if this is a stupid question.)

+6

scala aggregate apache-spark homogenous-transformation heterogeneous

Cuipengfei Oct 29 '14 at 15:47

source share

3 answers

In particular, in Spark, the calculation is distributed and performed in parallel, so foldLeft cannot be implemented as it is in the standard library. Instead, the aggregate requires two functions, one of which performs an operation similar to fold for each element of type T , creating a value of type U , and the other, which combines U from each section into a final value:

 def aggregate[U: ClassTag](zeroValue: U)(seqOp: (U, T) => U, combOp: (U, U) => U): U

+2

Dan osipov Oct 29 '14 at 17:06

source share

foldLeft, foldRight, reduceLeft, reduceRight, scanLeft and scanRight are operations in which the accumulated parameter may differ from the input parameters ( (A, B) -> B ), and these operations can only be performed sequentially.

fold is an operation in which the accumulated parameter must be of the same type of input parameters ( (A, A) -> A ). Then it can be executed in parallel.

aggregation is an operation in which the accumulated parameter can be of another type as input parameters, but then you must provide an additional function that determines how the accumulated parameters can be aggregated in the final result. This operation allows parallel execution. The aggregation operation is a combination of foldLeft and fold .

For more information, you can watch coursera videos for the Parallel Programming Course:

0

jalv1039 Jul 9 '17 at 17:44

source share

lmm · Accepted Answer · 2014-10-29T16:36:18+0000

fold can be implemented more efficiently because it does not depend on a fixed evaluation order. Thus, each cluster node can fold use its own piece in parallel, and then one small common fold at the end. While with foldLeft each element should be collapsed in order, and nothing can be done in parallel.

(It's also nice to have a simple API for the normal case for convenience. The standard lib has reduce as well as foldLeft for this reason)

Why aggregate and stack two different APIs in Spark?

More articles: