Is there any way to give the args Mapper constructor in Hadoop? Perhaps through some library that completes the job?
Here is my scenario:
public class HadoopTest { // Extractor turns a line into a "feature" public static interface Extractor { public String extract(String s); } // A concrete Extractor, configurable with a constructor parameter public static class PrefixExtractor implements Extractor { private int endIndex; public PrefixExtractor(int endIndex) { this.endIndex = endIndex; } public String extract(String s) { return s.substring(0, this.endIndex); } } public static class Map extends Mapper<Object, Text, Text, Text> { private Extractor extractor; // Constructor configures the extractor public Map(Extractor extractor) { this.extractor = extractor; } public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String feature = extractor.extract(value.toString()); context.write(new Text(feature), new Text(value.toString())); } } public static class Reduce extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for (Text val : values) context.write(key, val); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "test"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }
As should be clear, since Mapper is only assigned to Configuration as a reference to a class ( Map.class ), Hadoop is not able to pass a constructor argument and configure a specific Extractor.
There are Hadoop-wrapping frameworks like Scoobi, Crunch, Scrunch (and maybe many others that I donโt know about) that seem to have this ability, but I donโt know how they do it. EDIT: After we worked with Scoobi, I found that I was partially mistaken. If you use an external object in "mapper", Scoobi requires it to be serializable, and will complain at runtime if it is not. So maybe the right way is to serialize and deserialize the serialization of Extractor in the Mapper setup method ...
Also, I really work at Scala, so Scala-based solutions are certainly welcome (unless encouraged!)