Beehive, howoop and mechanics for hive.exec.reducers.max

In the context of this other question here

Using the hive.exec.reducers.max directive really puzzled me.

From my point of view, I thought that the hives were working on some kind of logic, like, I have N # blocks in the right request, so I need N cards. NI will need some reasonable range of gearboxes R, which can be anywhere from R = N / 2 to R = 1. For the report on the hives I worked, there were 1200+ maps and without any influence the bush made a plan for about 400 gearboxes, which were great, except that I was working on a cluster with only 70 gears. Even with the help of the fair job planner, this caused a lag that could hang other jobs. So I tried a lot of different experiments until I found hive.exec.reducers.max and set it to about 60.

The results were that the hive, which took 248 minutes, ended in 155 minutes without any changes in the result. What bothered me, why not remove the default hive so that N never becomes larger than the capacity of the cluster reducer, and, seeing that I can topple a few terabytes of data with a reduced set of reducers, then which hive thinks is right, is it always better to try and set up this account?

+3
source share
2 answers

You can see (which talks about optimizing the number of slots): http://wiki.apache.org/hadoop/LimitingTaskSlotUsage

Here is my opinion on the same:

1) , . , , .

2) :

  • , , 248 155 :

Case1: Hive 400 : 70 .

  • JVM. JVM .

  • : 400 ​​, . , , 70 , . 400 .

Case2: Hive 70 - .

, . . .

+2

mapred.job.reuse.jvm.num.tasks ( 8) . 20-30 , JVM, , (< 30 ).

+2

Source: https://habr.com/ru/post/1792726/


All Articles