Can Hadoop distribute tasks and code base?

Question

Can Hadoop distribute tasks and code base?

I start playing with hadoop (but I do not have access to the cluster, but I just play offline). My question is, after setting up the cluster, how are tasks distributed and can the code base be moved to new nodes?

Ideally, I would like to run large batch jobs, and if I need more features, add new nodes to the cluster, but I'm not sure if I will have to copy the same code that works locally, or do something special while batch is running task, I can add capacity. I thought I could save my code base on HDFS and make it run locally every time I need it, but it still means that I need some kind of source script on the server, and first I need to run it manually.

Any suggestions or tips, if possible, will be wonderful!

Thanks.

+6

hadoop hdfs distributed

Lostsoul Feb 17 '12 at 15:13

source share

4 answers

I do not quite agree with Daniel. First of all, if "when starting a task, the jar code will be copied to all nodes that the cluster knows about," even if you use 100 cards and there are 1000 nodes, the code for all tasks will always be copied to all nodes. It makes no sense.

Instead of Chris Shain’s answer, it’s more reasonable that whenever the JobScheduler on the JobTracker selects the task to be executed and identifies the task to be performed by the particular datanode, then at this time it somehow passes the tasktracker from where to copy the code base.

Initially (before starting the mapreduce job), the code base was copied to several locations, as defined by the mapred.submit.replication parameter. Therefore, tasktracker can copy the code base from several places, a list of which can be sent using jobtracker.

+1

Vaibhav agarwal Jun 15 '12 at 23:02

source share

Before trying to build a Hadoop cluster, I would suggest playing with Hadoop using Amazon Elastic MapReduce .

Regarding the problem you are trying to solve, I am not sure if Hadoop is suitable. Hadoop is useful for trivially parallelizable batch jobs: analyze thousands (or more) documents, sort, re-post data). Hadoop Streaming will allow you to create maps and reducers in any language that you like, but the inputs and outputs must be in a fixed format. There are many uses, but in my opinion, process control was not one of the design goals.

[EDIT] Perhaps ZooKeeper is closer to what you are looking for.

0

Frank Feb 17 '12 at 15:18

source share

You can add capacity to the batch job if you want, but you need to present it as an opportunity in your code base. For example, if you have a mapper that contains a set of inputs that you want to assign to multiple nodes, you can put pressure. All of this can be done, but not with the installation of Hadoop by default.

I am currently working on the Nested Map-Reduce framework, which extends the Hadoop code base and allows you to create more nodes based on the inputs that the mapper or reducer receives. If you're interested, drop me a line and I will explain more.

Also, when it comes to the -libjars option, this only works for nodes assigned by jobtracker, as described in the job you are writing. Therefore, if you specify 10 maps, -libjar will copy your code there. If you want to start at 10, but work your way up, the nodes you added will not have code.

The easiest way to get around this is to add your jar to the hadoop-env.sh script class hadoop-env.sh . This will always be when starting a copy of the task, which will be used in all nodes that the cluster knows.

0

inquire Feb 17 '12 at 23:43

source share

Chris shain · Accepted Answer · 2012-02-17T15:50:02+0000

When you schedule a mapreduce job using the hadoop jar command, jobtracker will determine how many cards are needed to do your job. This is usually determined by the number of blocks in the input file, and this number is fixed, regardless of how many work nodes you have. He will then enroll one or more supervisors to carry out your work.

The jar application (along with any other banks that are specified using the -libjars argument) is automatically copied to all machines running target trainers that are used to run your banners. All this is handled by the Hadoop infrastructure.

Adding additional controllers will increase the parallelism of your work, assuming there are still unplanned map tasks. What he will not do, automatically redistributes the input for parallelization over the additional capacity of the card. Therefore, if you have a card capacity of 24 (provided that there are 6 cards on each of the 4 data nodes), and you have 100 map jobs with the first 24 executable, and you add other node data, you will get some extra speed. If you have only 12 map tasks, adding machines won't help you.

Finally, you need to know the location of the data. Since data should ideally be processed on the same machines that store it at the initial stage, adding new task trackers will not necessarily add a proportional processing speed, since the data will not be local on these nodes initially and will need to be copied over the network.

Can Hadoop distribute tasks and code base?

More articles: