The best way to achieve this is through secondary sorting. You need to sort both keys (in the numbers of your number) and values โโ(in the file names of your case). In Hadoop, the cartographer's output is sorted by key only.
This can be achieved with a combined key: a key that is a combination of both numbers and file names. E.g. for the first record, the key will be (23, fileA), and not just (23).
You can read about secondary sorting here: https://www.safaribooksonline.com/library/view/data-algorithms/9781491906170/ch01.html
You can also skip to the " Secondary Sorting " section in the Hadoop The Definitive Guide .
For simplicity, I wrote a program to achieve the same.
In this program, keys are sorted by default by cartographers. I wrote the logic to sort the values โโon the gearbox side. Therefore, he takes care of sorting both keys and values โโand produces the desired result.
The following is the program:
package com.myorg.hadooptests; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import java.io.IOException; import java.util.*; public class SortedValue { public static class SortedValueMapper extends Mapper<LongWritable, Text , Text, IntWritable>{ public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String[] tokens = value.toString().split(" "); if(tokens.length == 2) { context.write(new Text(tokens[1]), new IntWritable(Integer.parseInt(tokens[0]))); } } } public static class SortedValueReducer extends Reducer<Text, IntWritable, IntWritable, Text> { Map<String, ArrayList<Integer>> valueMap = new HashMap<String, ArrayList<Integer>>(); public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { String keyStr = key.toString(); ArrayList<Integer> storedValues = valueMap.get(keyStr); for (IntWritable value : values) { if (storedValues == null) { storedValues = new ArrayList<Integer>(); valueMap.put(keyStr, storedValues); } storedValues.add(value.get()); } Collections.sort(storedValues); for (Integer val : storedValues) { context.write(new IntWritable(val), key); } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); Job job = Job.getInstance(conf, "CompositeKeyExample"); job.setJarByClass(SortedValue.class); job.setMapperClass(SortedValueMapper.class); job.setReducerClass(SortedValueReducer.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(IntWritable.class); job.setOutputKeyClass(IntWritable.class); job.setOutputValueClass(Text.class); FileInputFormat.addInputPath(job, new Path("/in/in1.txt")); FileOutputFormat.setOutputPath(job, new Path("/out/")); System.exit(job.waitForCompletion(true) ? 0:1); } }
Card Logic:
- Parses each line. It is assumed that the key and value are separated by a null character ("").
- If the string contains 2 tokens, it emits (file name, integer value). E.g. for the first record, it emits (fileA, 23).
Reduction logic:
It puts the pairs (key, value) in a HashMap, where key is the name of the file and value is a list of integers for this file. E.g. for file A, the saved values โโwill be 23, 34 and 35
Finally, it sorts the values โโfor a particular key and for each value emits (value, key) from the reducer. E.g. for fileA, the output is: (23, fileA), (34, fileA) and (35, fileA)
I ran this program for the following input:
34 fileB 35 fileA 60 fileC 60 fileA 23 fileA
I got the following output:
23 fileA 35 fileA 60 fileA 34 fileB 60 fileC