Both Join and CoGroup transformations combine two inputs into key fields. Differences in how user functions are called:
- The Join transformation calls a
JoinFunction with pairs of matching records from both inputs that have the same values โโfor key fields. This behavior is very similar to an internal union of equality. - the CoGroup transformation calls
CoGroupFunction with iterators over all records of both inputs that have the same values โโfor key fields. If there are no entries for a specific key value, an empty iterator is skipped. The CoGroup transformation can be used, among other things, for internal and external equality. This is therefore more general than the Join transformation.
If you look at the execution strategies of Join and CoGroup, Join can be performed using merge strategies based on sorting and hash, where CoGroup is always executed using strategies based on sorting. Consequently, associations are often more effective than cogroups, and should be preferred if possible.
source share