When you schedule a mapreduce job using the hadoop jar
command, jobtracker will determine how many cards are needed to do your job. This is usually determined by the number of blocks in the input file, and this number is fixed, regardless of how many work nodes you have. He will then enroll one or more supervisors to carry out your work.
The jar application (along with any other banks that are specified using the -libjars
argument) is automatically copied to all machines running target trainers that are used to run your banners. All this is handled by the Hadoop infrastructure.
Adding additional controllers will increase the parallelism of your work, assuming there are still unplanned map tasks. What he will not do, automatically redistributes the input for parallelization over the additional capacity of the card. Therefore, if you have a card capacity of 24 (provided that there are 6 cards on each of the 4 data nodes), and you have 100 map jobs with the first 24 executable, and you add other node data, you will get some extra speed. If you have only 12 map tasks, adding machines won't help you.
Finally, you need to know the location of the data. Since data should ideally be processed on the same machines that store it at the initial stage, adding new task trackers will not necessarily add a proportional processing speed, since the data will not be local on these nodes initially and will need to be copied over the network.
source share