This video tutorial by Cloudera gives an excellent description of how to make a large-scale connection through MapReduce, starting at 12 minutes.
Here are the basic steps he outlines for combining entries from file B into entries from file A on key K with pseudo-code. If something is unclear here, I would suggest watching the video because it does a much better job of explaining this than I can.
In your mapper:
K from file A: tag K to identify as Primary Key emit <K, value of K> K from file B: tag K to identify as Foreign Key emit <K, record>
Write a Sorter and Grouper that will ignore PK / FK tags so that your records are sent to the same reducer, regardless of whether they are a PK record or an FK record and are grouped together.
Write a Comparator that will compare PK and FK keys and send PK first.
The result of this step is that all records with the same key will be sent to the same reducer and will be in the same set of values that will be reduced. First, an entry with the label PK will be marked, followed by all entries from B that need to be combined. Now, the Reducer:
value_of_PK = values[0] // First value is the value of your primary key for value in values[1:]: value.replace(FK,value_of_PK) // Replace the foreign key with the key value emit <key, value>
The result will be file B, with all occurrences of K replaced by the value K in file A. You can also expand it to make a full internal join, or write both files completely for direct storage of the database, but these are pretty trivial modifications as soon as you do it.
source share