Mapping two datasets using Hadoop

Question

Mapping two datasets using Hadoop

Suppose I have two data sets with a key - data sets A and B, call them. I want to update all the data in Set A with the data from Set B, where the two are combined with the keys.

Since I deal with such large amounts of data, I use Hadoop for MapReduce. My concern is that in order to fulfill this key match between A and B, I need to load the entire set of A (large amount of data) into the memory of each instance of the mapping. It seems rather inefficient.

Will there be a recommended way to do this that does not require the download to work every time?

Some pseudo codes to clarify what I'm doing now:

Load in Data Set A # This seems like the expensive step to always be doing Foreach key/value in Data Set B: If key is in Data Set A: Update Data Seta A

+4

mapreduce hadoop

babonk 12 sept '12 at 23:21

source share

3 answers

All answers published so far are correct - it should be a connection with a decrease in the side ... but there is no need to reinvent the wheel! Have you considered Pig , Hive , or Cascading for this? They all come together and are well optimized.

+3

Joe k Sep 13 '12 at 18:30

source share

This video tutorial by Cloudera gives an excellent description of how to make a large-scale connection through MapReduce, starting at 12 minutes.
Here are the basic steps he outlines for combining entries from file B into entries from file A on key K with pseudo-code. If something is unclear here, I would suggest watching the video because it does a much better job of explaining this than I can.

In your mapper:

 K from file A: tag K to identify as Primary Key emit <K, value of K> K from file B: tag K to identify as Foreign Key emit <K, record>

Write a Sorter and Grouper that will ignore PK / FK tags so that your records are sent to the same reducer, regardless of whether they are a PK record or an FK record and are grouped together.

Write a Comparator that will compare PK and FK keys and send PK first.

The result of this step is that all records with the same key will be sent to the same reducer and will be in the same set of values that will be reduced. First, an entry with the label PK will be marked, followed by all entries from B that need to be combined. Now, the Reducer:

 value_of_PK = values[0] // First value is the value of your primary key for value in values[1:]: value.replace(FK,value_of_PK) // Replace the foreign key with the key value emit <key, value>

The result will be file B, with all occurrences of K replaced by the value K in file A. You can also expand it to make a full internal join, or write both files completely for direct storage of the database, but these are pretty trivial modifications as soon as you do it.

+2

HypnoticSheep Sep 13 '12 at 17:50

source share

Shahin · Accepted Answer · 2012-09-13T00:34:38+0000

According to the documentation, MapReduce infrastructure includes the following steps :

Map
Sort / Partition
Merge (optional)
Abbreviation

You described one way to make your connection: load all Set A into memory in each Mapper. You are right that this is inefficient.

Instead, note that a large join can be arbitrarily split into many smaller joins if both sets are sorted and partitioned by key. MapReduce sorts the output of each Mapper using the key in step (2) above. The sorted output of the card is then split into a key, so that one section is created for each gearbox. For each unique key, the reducer will receive all values from both set and set B.

To terminate its connection, Reducer only needs to infer the key and the updated value from Set B, if it exists; otherwise, print the key and the source value from Set A. To distinguish between the values from Set A and Set B, try setting the flag for the output value from Mapper.

Mapping two datasets using Hadoop

More articles: