At the very top of your Hive magazine, it says: "Warning: Shuffle Join JOIN [4] [Tables a, b] in Stage" Stage-1 Mapred is a cross-product. "
EDIT: A "cross product" or a Cartesian product is an unconditional join that returns every row in table "b" for every row of table "a". So, if you take the example "a", this is 5 lines, and "b" is 10 lines, you get a product or 5 is multiplied by 10 = 50 lines. There will be many rows that are completely βzeroβ for one or other tables.
Now, if you have a table βaβ of 20,000 rows and join it to another table βbβ of 500,000 rows, you will ask the SQL engine to return you a data set βa, bβ of 10,000,000,000 rows, and then perform the operation BETWEEN on 10 million rows.
So, if you drop the number of lines "b", you will see that you will get more advantages than "a" - in your example, if you can filter the ip_logs table, table 2, since I make the assumption that there are more lines in it than in the table of order numbers, this will reduce lead time. End edit
You force the execution engine to work through the Cartesian product without specifying a condition for the join. He has to scan the whole table again and again. With 10 rows you will not have a problem. With 20k you come across dozens of map / zoom waves.
Try this query:
SELECT b.itemcode FROM foo a JOIN bar b on <SomeKey> WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;
But itβs hard for me to understand in which column your model will allow you to join. Maybe the data model for this expression can be improved? Maybe I just donβt read the sample.
In any case, you need to filter the number of comparisons before the where clause. Other ways I did this in Hive is to browse with a smaller dataset and join / match the view instead of the original table.