Hadoop data flow efficiency for "customers who bought x also bought y"

Question

Hadoop data flow efficiency for "customers who bought x also bought y"

I started working with Hadoop, and I'm working on creating a MapReduce network for "customers who bought x, also bought y," where y is the product that is most often bought with x. I am looking for advice on improving the efficiency of this task, and I mean reducing the amount of data shuffled from node maps to a node reducer. My goal is slightly different from other Buy-x-Buy scenarios because I just want to store the most frequently purchased product for this product, and not the list of products purchased with this product, rated by frequency.

I am following this blog post to guide my approach.

If, as I understand it, one of the big performance limiters in Hadoop shuffles data from map nodes to the node reducer, then for each phase of the MapReduce chain I want to keep the amount of shuffled data at least.

Say my initial dataset is the SQL table purchases_products , the table of connections between the purchase and the products that were purchased in this purchase. I will feed select x.product_id, y.product_id from purchases_products x inner join purchases_products y on x.purchase_id = y.purchase_id and x.product_id != y.product_id in my MapReduce operation.

My MapReduce strategy is to map product_id_x, product_id_y to product_id_x_product_id_y, 1 , and then sum the values in my reduction step. Then I can split the keys and save the pairs back to the SQL table.

My problem with this operation is that it moves a potentially huge number of rows, although the size of the result set I want to create is count(products) large. Ideally, I would like the combinatorial step to reduce the number of rows shuffled to the gearboxes at this stage, but I see no way to reliably do this.

Is this just a limitation of the task or are there Hadoop tricks for organizing the workflow that will help me compress data in a random order during the second step? Can I worry about the amount of shuffle in this case, or not?

Thanks!

+4

hadoop

maxenglander Mar 19 '12 at 16:34

source share

1 answer

Chris white · Accepted Answer · 2012-03-21T00:17:25+0000

Depending on how large your products are (therefore, determining the number of possible pairs of products), you can look at local aggregation on the side of the map.

Maintain a map of product pairs in frequency in your cartographer and instead of writing each product pair and a value of 1 to context, accumulate them on the map. When the card gets a predetermined size, discard the card to the output context. You can even use the LRU card to store the most common pairs on the card and write out these “expired” records when they are being crowded out.

An example suitable for the Word Count example, see http://www.wikidoop.com/wiki/Hadoop/MapReduce/Mapper#Map_Aggregation

Of course, if you have a huge range of products or random pairs of products, this will not save you. You also need to understand how big your card is before you run out of available JVM memory.

You can also look at decreasing the amount of data stored in your Key / Value objects.

Are integer numbers of product identifiers (they are relatively low in cost) can they be written as VIntWritable, not IntWritable?)
If they are integers, you write the product pair key as a string representation of concatenated identifiers or using a user key with two int fields (so write 4 + 4 bytes, not a potentially larger number if you use a string representation)
Do you write the value '1' as VIntWritable?

Hadoop data flow efficiency for "customers who bought x also bought y"

More articles: