I started working with Hadoop, and I'm working on creating a MapReduce network for "customers who bought x, also bought y," where y is the product that is most often bought with x. I am looking for advice on improving the efficiency of this task, and I mean reducing the amount of data shuffled from node maps to a node reducer. My goal is slightly different from other Buy-x-Buy scenarios because I just want to store the most frequently purchased product for this product, and not the list of products purchased with this product, rated by frequency.
I am following this blog post to guide my approach.
If, as I understand it, one of the big performance limiters in Hadoop shuffles data from map nodes to the node reducer, then for each phase of the MapReduce chain I want to keep the amount of shuffled data at least.
Say my initial dataset is the SQL table purchases_products , the table of connections between the purchase and the products that were purchased in this purchase. I will feed select x.product_id, y.product_id from purchases_products x inner join purchases_products y on x.purchase_id = y.purchase_id and x.product_id != y.product_id in my MapReduce operation.
My MapReduce strategy is to map product_id_x, product_id_y to product_id_x_product_id_y, 1 , and then sum the values โโin my reduction step. Then I can split the keys and save the pairs back to the SQL table.
My problem with this operation is that it moves a potentially huge number of rows, although the size of the result set I want to create is count(products) large. Ideally, I would like the combinatorial step to reduce the number of rows shuffled to the gearboxes at this stage, but I see no way to reliably do this.
Is this just a limitation of the task or are there Hadoop tricks for organizing the workflow that will help me compress data in a random order during the second step? Can I worry about the amount of shuffle in this case, or not?
Thanks!
source share