GroupByKey vs. aggregateByKey - where exactly is this difference?

There groupByKeyis some kind of scary language in the documents , warning that it can be “very expensive” and suggests using it aggregateByKeyinstead whenever possible.

I am wondering if the cost difference depends on the fact that for some populations the whole group should never be assembled and loaded into the same node, or if there are other differences in implementation.

Basically, the question is whether it will be rdd.groupByKey()equivalent rdd.aggregateByKey(Nil)(_ :+ _, _ ++ _), or if it will still be more expensive.

+4
source share
2 answers

If you reduce to one item instead of a list.

, , , aggregateByKey , , vs aggregate .

. aggregateByKey , shuffle.This , .

groupByKey node, . . then groupByKey .

.: SO zero323

. , , , aggregateByKey/combByKey groupByKey. : 10 . , combineByKey with proper merge and combiner functions, groupByKey and take 10.

+4

, groupByKey

, , , .

,

def groupByKey(): RDD[(K, Iterable[V])]

"groupby" , , , , . , .

aggregateByKey . :

def aggregateByKey[U](zeroValue: U)(seqOp: (U, V) ⇒ U, combOp: (U, U) ⇒ U)(implicit arg0: ClassTag[U]): RDD[(K, U)]

:

, , " " .

.

-1

Source: https://habr.com/ru/post/1686031/


All Articles