Using the standard Scala lib, I can do something like this:
scala> val scalaList = List(1,2,3) scalaList: List[Int] = List(1, 2, 3) scala> scalaList.foldLeft(0)((acc,n)=>acc+n) res0: Int = 6
Create one Int from many Ints.
And I can do something like this:
scala> scalaList.foldLeft("")((acc,n)=>acc+n.toString) res1: String = 123
Create one line from many ints.
Thus, foldLeft can be either homogeneous or heterogeneous, depending on what we want, in one API.
In Spark, if I want one Int out of many Ints, I can do this:
scala> val rdd = sc.parallelize(List(1,2,3)) rdd: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:12 scala> rdd.fold(0)((acc,n)=>acc+n) res1: Int = 6
The fold API is similar to foldLeft, but it is only uniform, RDD [Int] can only produce Int with a bend.
Sparks also has an aggregated API:
scala> rdd.aggregate("")((acc,n)=>acc+n.toString, (s1,s2)=>s1+s2) res11: String = 132
It is heterogeneous, RDD [Int] can now create a string.
So why are folds and aggregates implemented as two different APIs in Spark?
Why are they not designed as foldLeft, which can be either homogeneous or heterogeneous?
(I am very new to Sparks, please excuse me if this is a stupid question.)