Is this architecture possible in Hadoop MR?

Is the following architecture possible in Hadoop MapReduce?

The distributed keystore (HBase) is used. Thus, along with the values, there would be a timestamp associated with the values. The tasks "Map" and "Reduce" are performed iteratively. The map at each iteration should take values ​​that were added to the previous iteration in the repository (perhaps with the last timestamp?). The reduction should be done in Map output, as well as in pairs from the store, whose key corresponds to the key (s), which reduces, should be processed in the current iteration. The reduction output goes to the repository.

If possible, then which classes (for example: InputFormat, run () of Reduce) should be extended so that the above operation is performed instead of the usual operation. If this is not possible, are there alternatives to achieve the same?

+3
source share
2 answers

So, your “store” in iteration n-1 can be as follows:

key (timestamp | value)

a 1 | x, b 2 | x, c 3 | x, d 4 | x

In iteration n of these pairs, where added: ... b 5 | x, d 6 | x

The imaging device will find these 2 entries since the timestamp is> 4 and put them in the intermediate results

Now the reducer would find that for these two entries there are two more matching entries in the n-1 repository: b 2 | x, d 4 | x

, ( ):   b 5 | x,   d 6 | x,   b 2 | x,   d 4 | x

, ?

0

, , : IdentityMapper, .

. :

  • HadoopKey = {key | timestamp}
  • HadoopValue = {)

, , , , . (GroupingComparator)

, , . (KeyComparator)

  • RawComparator,
  • Jobconf setOutputValueGroupingComparator() &
  • setOutputKeyComparatorClass()
  • "Hadoop - ", 4, . 100
  • , ; -)

- oops, spoiler..., . , . , .

, .

0

Source: https://habr.com/ru/post/1732739/


All Articles