In the context of this other question here
Using the hive.exec.reducers.max directive really puzzled me.
From my point of view, I thought that the hives were working on some kind of logic, like, I have N # blocks in the right request, so I need N cards. NI will need some reasonable range of gearboxes R, which can be anywhere from R = N / 2 to R = 1. For the report on the hives I worked, there were 1200+ maps and without any influence the bush made a plan for about 400 gearboxes, which were great, except that I was working on a cluster with only 70 gears. Even with the help of the fair job planner, this caused a lag that could hang other jobs. So I tried a lot of different experiments until I found hive.exec.reducers.max and set it to about 60.
The results were that the hive, which took 248 minutes, ended in 155 minutes without any changes in the result. What bothered me, why not remove the default hive so that N never becomes larger than the capacity of the cluster reducer, and, seeing that I can topple a few terabytes of data with a reduced set of reducers, then which hive thinks is right, is it always better to try and set up this account?
source
share