Hadoop - Args Mapper Designer

Question

Hadoop - Args Mapper Designer

Is there any way to give the args Mapper constructor in Hadoop? Perhaps through some library that completes the job?

Here is my scenario:

public class HadoopTest { // Extractor turns a line into a "feature" public static interface Extractor { public String extract(String s); } // A concrete Extractor, configurable with a constructor parameter public static class PrefixExtractor implements Extractor { private int endIndex; public PrefixExtractor(int endIndex) { this.endIndex = endIndex; } public String extract(String s) { return s.substring(0, this.endIndex); } } public static class Map extends Mapper<Object, Text, Text, Text> { private Extractor extractor; // Constructor configures the extractor public Map(Extractor extractor) { this.extractor = extractor; } public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String feature = extractor.extract(value.toString()); context.write(new Text(feature), new Text(value.toString())); } } public static class Reduce extends Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException { for (Text val : values) context.write(key, val); } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = new Job(conf, "test"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(Text.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); } }

As should be clear, since Mapper is only assigned to Configuration as a reference to a class ( Map.class ), Hadoop is not able to pass a constructor argument and configure a specific Extractor.

There are Hadoop-wrapping frameworks like Scoobi, Crunch, Scrunch (and maybe many others that I don’t know about) that seem to have this ability, but I don’t know how they do it. EDIT: After we worked with Scoobi, I found that I was partially mistaken. If you use an external object in "mapper", Scoobi requires it to be serializable, and will complain at runtime if it is not. So maybe the right way is to serialize and deserialize the serialization of Extractor in the Mapper setup method ...

Also, I really work at Scala, so Scala-based solutions are certainly welcome (unless encouraged!)

+6

java scala hadoop

dhg Nov 15 '11 at 18:59

source share

5 answers

I suggest you tell your cartographer which extractor to use with the Configuration object you create. The converter gets the configuration in its setup method ( context.getConfiguration() ). It looks like you cannot put objects in the configuration, since they are usually created from XML files or the command line, but you can set the enumeration value and configure its extractor itself. It is not very nice to configure mapper after creating it, but this is how I interpreted the API.

+7

wutz Nov 15 '11 at 19:32

source share

Set the name of the implementation class when submitting the job as

 Configuration conf = new Configuration(); conf.set("PrefixExtractorClass", "com.my.class.ThreePrefixExtractor");

or use the - D option from the command line to set the PrefixExtractorClass parameter.

Below is the implementation in mapper

 Extractor extractor = null; protected void setup(Context context) throws IOException, InterruptedException { try { Configuration conf = context.getConfiguration(); String className = conf.get("PrefixExtractorClass"); extractor = Class.forName(className); } Catch (ClassNotFoundException e) { //handle the exception } }

Now use the extractor object as required in the map function.

A box containing the class com.my.class.ThreePrefixExtractor must be distributed across all nodes. Here is an article from Cloudera on various ways of doing this.
In the above example, com.my.class.ThreePrefixExtractor should extend the extractor class.

Using this approach, the mapper implementation can be made universal. This is an approach (using Class.forName) taken by most frameworks to have plug-ins that implement a specific interface.

+5

Praveen sripati Dec 6 '11 at 6:18

source share

I'm still looking for good answers, but one (not perfect) solution I came across is to use inheritance instead of composition, turning Map into an abstract Extractor class. It can then be subclassed up to including all of the constructor arguments (shown below).

  public static abstract class Extractor extends Mapper<Object, Text, Text, Text> { public void map(Object key, Text value, Context context) throws IOException, InterruptedException { String feature = extract(value.toString()); context.write(new Text(feature), new Text(value.toString())); } public abstract String extract(String s); } public static abstract class PrefixExtractor extends Extractor { public String extract(String s) { return s.substring(0, getEndIndex()); } public abstract int getEndIndex(); } public static class ThreePrefixExtractor extends PrefixExtractor { public int getEndIndex() { return 3; } }

However, this is not so pleasant, and I really feel that there must be a way to do it right.

(I moved this from the original question to make things a bit less cluttered.)

+1

dhg Nov 21 '11 at 6:57

source share

For another similar solution, take a look at:

https://github.com/NICTA/scoobi/blob/master/src/main/scala/com/nicta/scoobi/impl/rtt/ClassBuilder.scala

how do we do it. It uses reflection to create some java source code that, when launched, creates the same graph of objects. Then we compile this source (using javassist) and include it in the jar sent to the cluster.

This is pretty cool if you want to connect it, it handles things like cyclic graphs of objects and all special cases (there are quite a few).

0

Heptic Jan 22 '12 at 6:25

source share

dhg · Accepted Answer · 2011-12-06T09:23:14+0000

The best solution I've come up with so far is to pass the serialized version of the object I want for Mapper and use reflection to build the object at runtime.

So the main method would say something like:

 conf.set("ExtractorConstructor", "dicta03.hw4.PrefixExtractor(3)");

Then in Mapper we use the construct helper function (defined below) and we can say:

 public void setup(Context context) { try { String constructor = context.getConfiguration().get("ExtractorConstructor"); this.extractor = (Extractor) construct(constructor); } catch (Exception e) { throw new RuntimeException(e); } }

A construct definition that uses reflection to recursively construct an object at run time from a string:

 public static Object construct(String s) throws ClassNotFoundException, NoSuchMethodException, IllegalAccessException, InstantiationException, InvocationTargetException { if (s.matches("^[A-Za-z0-9.#]+\\(.*\\)$")) { Class cls = null; List<Object> argList = new ArrayList<Object>(); int parenCount = 0; boolean quoted = false; boolean escaped = false; int argStart = -1; for (int i = 0; i < s.length(); i++) { if (escaped) { escaped = false; } else if (s.charAt(i) == '\\') { escaped = true; } else if (s.charAt(i) == '"') { quoted = true; } else if (!quoted) { if (s.charAt(i) == '(') { if (cls == null) cls = Class.forName(s.substring(0, i)); parenCount++; argStart = i + 1; } else if (s.charAt(i) == ')') { if (parenCount == 1) argList.add(construct(s.substring(argStart, i))); parenCount--; } else if (s.charAt(i) == ',') { if (parenCount == 1) { argList.add(construct(s.substring(argStart, i))); argStart = i + 1; } } } } Object[] args = new Object[argList.size()]; Class[] argTypes = new Class[argList.size()]; for (int i = 0; i < argList.size(); i++) { argTypes[i] = argList.get(i).getClass(); args[i] = argList.get(i); } Constructor constructor = cls.getConstructor(argTypes); return constructor.newInstance(args); } else if (s.matches("^\".*\"$")) { return s.substring(1, s.length() - 1); } else if (s.matches("^\\d+$")) { return Integer.parseInt(s); } else { throw new RuntimeException("Cannot construct " + s); } }

(This may not be the most reliable parser, but it can be easily expanded to cover more types of objects.)

Hadoop - Args Mapper Designer

More articles: