I see that whenever I run the "Reduce map" task, the hadoop operation shows me the percentage of completed "Map" and "Reduce" operations.
I understand that both the mappers and gearboxes work in a distributed manner and can report how much they processed to the controller.
But how does the controller know the general data to process? If the controller tries to determine the size of all input files, I would like it to be inefficient. Is this some crude approximation?

source
share