Hadoop Map-side join combine hash join?

I am trying to implement a Hash union in Hadoop.

However, Hadoop seems to have already joined the map and has already joined the connection from the smaller side.

What is the difference between these tricks and a hash join?

+3
source share
2 answers

Joining a card

In the connection on the side of the map (fragment-replication), you hold one data set in memory (for example, a hash table) and join another data set while recording. In Pig you write

edges_from_list = JOIN a_follows_b BY user_a_id, some_list BY user_id using 'replicated';

taking care that the smaller data set is on the right. This is extremely efficient since there is no network overhead or minimum CPU requirements.

Reduce connection

, mero.

<user_id   {A, B, F, ..., Z},  { A, C, G, ..., Q} >

:

[A   user_id    A]
[A   user_id    C]
...
[A   user_id    Q]
...
[Z   user_id    Q]

, - . Pig , . ( -, ).

, . ( , ). , ; , , .

, , . , , , , () .

- . Zebra () .

+7

Hadoop , () . , - , -. " " " " MapReduce Jimmy Lin Chris Dyer, .

0

Source: https://habr.com/ru/post/1745193/


All Articles