Hadoop cache file for all map tasks

My map function should read a file for every input. This file does not change at all; it is read-only. Distributed cache can help me, I think, but I can’t find a way to use it. The public void configure (JobConf conf) function that I need to override, I believe it is deprecated. Good JobConf is not recommended. All DistributedCache tutorials use an obsolete method. What can I do? Is there any other setting function that I can override?

These are the very first lines of my display function:

Configuration conf = new Configuration(); //load the MFile FileSystem fs = FileSystem.get(conf); Path inFile = new Path("planet/MFile"); FSDataInputStream in = fs.open(inFile); DecisionTree dtree=new DecisionTree().loadTree(in); 

I want to cache this MFile so that my map function does not need to be viewed again and again

+4
source share
2 answers

Jobconf deprecated at 0.20. x, but in 1.0.0 this is not so! :-) (from the moment of writing)

To your question, there are two ways to start working with mailing reduction in java, one using the extending classes in the org.apache.hadoop.mapreduce package, and the other using the implementing classes in the org.apache.hadoop.mapred package (or bypass )

Not sure which one you are using, if you don't have the configure method to override, you will get the setup method to override.

 @Override protected void setup(Context context) throws IOException, InterruptedException 

This is similar to configure and should help you.

You get the setup method for override when you extend Mapper class in org.apache.hadoop.mapreduce .

+1
source

Ok, I did it, I think. I followed the advice of Ravi Bhatt and I wrote this:

  @Override protected void setup(Context context) throws IOException, InterruptedException { FileSystem fs = FileSystem.get(context.getConfiguration()); URI files[]=DistributedCache.getCacheFiles(context.getConfiguration()); Path path = new Path(files[0].toString()); in = fs.open(path); dtree=new DecisionTree().loadTree(in); } 

Inside my main method, I do this to add it to the cache:

  DistributedCache.addCacheFile(new URI(args[0]+"/"+"MFile"), conf); Job job = new Job(conf, "MR phase one"); 

I can get the file that I need, but I can’t say if it works 100%. Is there any way to check this? Thanks.

+5
source

Source: https://habr.com/ru/post/1396331/


All Articles