GroupByKey vs. aggregateByKey - where exactly is this difference?

Question

GroupByKey vs. aggregateByKey - where exactly is this difference?

There groupByKeyis some kind of scary language in the documents , warning that it can be “very expensive” and suggests using it aggregateByKeyinstead whenever possible.

I am wondering if the cost difference depends on the fact that for some populations the whole group should never be assembled and loaded into the same node, or if there are other differences in implementation.

Basically, the question is whether it will be rdd.groupByKey()equivalent rdd.aggregateByKey(Nil)(_ :+ _, _ ++ _), or if it will still be more expensive.

+4

scala apache-spark

Dima 20 sept. '17 at 11:22

source share

2 answers

, groupByKey

, , , .

,

def groupByKey(): RDD[(K, Iterable[V])]

"groupby" , , , , . , .

aggregateByKey . :

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

:

, , " " .

.

-1

ashburshui 20 . '17 11:53

Knight71 · Accepted Answer · 2017-09-20T12:45:22+0000

If you reduce to one item instead of a list.

, , , aggregateByKey , , vs aggregate .

. aggregateByKey , shuffle.This , .

groupByKey node, . . then groupByKey .

.: SO zero323

. , , , aggregateByKey/combByKey groupByKey. : 10 . , combineByKey with proper merge and combiner functions, groupByKey and take 10.

GroupByKey vs. aggregateByKey - where exactly is this difference?

More articles: