Performance issue in hive version 0.13.1

Question

Performance issue in hive version 0.13.1

I use AWS-EMR to run my queries on Hive, and I have a performance issue when running hiv version 0.13.1.

A newer version of the hive took about 5 minutes to run 10 rows of data. But the same script for 230804 lines takes 2 days and still works. What should I do to analyze and fix the problem?

Sample data:

Table 1:

hive> describe foo; OK orderno string Time taken: 0.101 seconds, Fetched: 1 row(s)

Example data for table 1:

 hive>select * from foo; OK 1826203307 1826207803 1826179498 1826179657

Table 2:

 hive> describe de_geo_ip_logs; OK id bigint startorderno bigint endorderno bigint itemcode int Time taken: 0.047 seconds, Fetched: 4 row(s)

Sample data for table 2:

 hive> select * from bar; 127698025 417880320 417880575 306 127698025 3038626048 3038626303 584 127698025 3038626304 3038626431 269 127698025 3038626560 3038626815 163

My request:

 SELECT b.itemcode FROM foo a, bar b WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;

Attached the hive logs for the above query.

+6

amazon-web-services hadoop emr hive ami

brisk Jan 12 '15 at 11:17

source share

1 answer

suiterdev · Accepted Answer · 2015-01-20T16:13:11+0000

At the very top of your Hive magazine, it says: "Warning: Shuffle Join JOIN [4] [Tables a, b] in Stage" Stage-1 Mapred is a cross-product. "

EDIT: A "cross product" or a Cartesian product is an unconditional join that returns every row in table "b" for every row of table "a". So, if you take the example "a", this is 5 lines, and "b" is 10 lines, you get a product or 5 is multiplied by 10 = 50 lines. There will be many rows that are completely “zero” for one or other tables.

Now, if you have a table “a” of 20,000 rows and join it to another table “b” of 500,000 rows, you will ask the SQL engine to return you a data set “a, b” of 10,000,000,000 rows, and then perform the operation BETWEEN on 10 million rows.

So, if you drop the number of lines "b", you will see that you will get more advantages than "a" - in your example, if you can filter the ip_logs table, table 2, since I make the assumption that there are more lines in it than in the table of order numbers, this will reduce lead time. End edit

You force the execution engine to work through the Cartesian product without specifying a condition for the join. He has to scan the whole table again and again. With 10 rows you will not have a problem. With 20k you come across dozens of map / zoom waves.

Try this query:

  SELECT b.itemcode FROM foo a JOIN bar b on <SomeKey> WHERE a.orderno BETWEEN b.startorderno AND b.endorderno;

But it’s hard for me to understand in which column your model will allow you to join. Maybe the data model for this expression can be improved? Maybe I just don’t read the sample.

In any case, you need to filter the number of comparisons before the where clause. Other ways I did this in Hive is to browse with a smaller dataset and join / match the view instead of the original table.

Performance issue in hive version 0.13.1

More articles: