JSON handling using java Mapreduce

Question

JSON handling using java Mapreduce

I'm new to hasoop mapreduce

I have a text input file where the data was saved as follows. Here are just a few tuples (data.txt)

{"author":"Sharīf Qāsim","book":"al- Rabīʻ al-manshūd"} {"author":"Nāṣir Nimrī","book":"Adīb ʻAbbāsī"} {"author":"Muẓaffar ʻAbd al-Majīd Kammūnah","book":"Asmāʼ Allāh al-ḥusná al-wāridah fī muḥkam kitābih"} {"author":"Ḥasan Muṣṭafá Aḥmad","book":"al- Jabhah al-sharqīyah wa-maʻārikuhā fī ḥarb Ramaḍān"} {"author":"Rafīqah Salīm Ḥammūd","book":"Taʻlīm fī al-Baḥrayn"}

This is my java file that I have to write in my code (CombineBooks.java)

 package org.hwone; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.util.GenericOptionsParser; //TODO import necessary components /* * Modify this file to combine books from the same other into * single JSON object. * ie {"author": "Tobias Wells", "books": [{"book":"A die in the country"},{"book": "Dinky died"}]} * Beaware that, this may work on anynumber of nodes! * */ public class CombineBooks { //TODO define variables and implement necessary components public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); String[] otherArgs = new GenericOptionsParser(conf, args) .getRemainingArgs(); if (otherArgs.length != 2) { System.err.println("Usage: CombineBooks <in> <out>"); System.exit(2); } //TODO implement CombineBooks Job job = new Job(conf, "CombineBooks"); //TODO implement CombineBooks System.exit(job.waitForCompletion(true) ? 0 : 1); } }

My task is to create a Hadoop program in "CombineBooks.java" returns to the "question-2" directory. The program should do the following: considering the input tuples of the author’s book, reduce the map; the program should pass a JSON object that contains all the books of the same author in the JSON array, i.e.

 {"author": "Tobias Wells", "books":[{"book":"A die in the country"},{"book": "Dinky died"}]}

Any idea how this can be done?

+5

json mapreduce hadoop

abu Oct 30 '14 at 17:37

source share

2 answers

Refer to shared multi-line JSON: https://github.com/alexholmes/json-mapreduce

0

Sanjiv Jul 9 '15 at 9:28

source share

0x0FFF · Accepted Answer · 2014-11-01T15:16:08+0000

First, the JSON objects you are trying to work with are not available to you. To solve this problem:

Go here and download as zip: https://github.com/douglascrockford/JSON-java
Extract to sources folder in subdirectory org / json / *

Next, the first line of your code creates the package "org.json", which is incorrect, and you create a separate package, for example, "my.books".

Thirdly, using a combiner is useless here.

Here is the code I got into, it works and solves your problem:

 package my.books; import java.io.IOException; import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.LongWritable; import org.apache.hadoop.io.NullWritable; import org.apache.hadoop.io.Text; import org.apache.hadoop.mapreduce.Job; import org.apache.hadoop.mapreduce.Mapper; import org.apache.hadoop.mapreduce.Reducer; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; import org.apache.hadoop.util.GenericOptionsParser; import org.json.*; import javax.security.auth.callback.TextInputCallback; public class CombineBooks { public static class Map extends Mapper<LongWritable, Text, Text, Text>{ public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException{ String author; String book; String line = value.toString(); String[] tuple = line.split("\\n"); try{ for(int i=0;i<tuple.length; i++){ JSONObject obj = new JSONObject(tuple[i]); author = obj.getString("author"); book = obj.getString("book"); context.write(new Text(author), new Text(book)); } }catch(JSONException e){ e.printStackTrace(); } } } public static class Reduce extends Reducer<Text,Text,NullWritable,Text>{ public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException{ try{ JSONObject obj = new JSONObject(); JSONArray ja = new JSONArray(); for(Text val : values){ JSONObject jo = new JSONObject().put("book", val.toString()); ja.put(jo); } obj.put("books", ja); obj.put("author", key.toString()); context.write(NullWritable.get(), new Text(obj.toString())); }catch(JSONException e){ e.printStackTrace(); } } } public static void main(String[] args) throws Exception { Configuration conf = new Configuration(); if (args.length != 2) { System.err.println("Usage: CombineBooks <in> <out>"); System.exit(2); } Job job = new Job(conf, "CombineBooks"); job.setJarByClass(CombineBooks.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setMapOutputKeyClass(Text.class); job.setMapOutputValueClass(Text.class); job.setOutputKeyClass(NullWritable.class); job.setOutputValueClass(Text.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); System.exit(job.waitForCompletion(true) ? 0 : 1); } }

Here is the folder structure of my project:

 src src/my src/my/books src/my/books/CombineBooks.java src/org src/org/json src/org/json/zip src/org/json/zip/BitReader.java ... src/org/json/zip/None.java src/org/json/JSONStringer.java src/org/json/JSONML.java ... src/org/json/JSONException.java

Here is the entrance

 [localhost:CombineBooks]$ hdfs dfs -cat /example.txt {"author":"author1", "book":"book1"} {"author":"author1", "book":"book2"} {"author":"author1", "book":"book3"} {"author":"author2", "book":"book4"} {"author":"author2", "book":"book5"} {"author":"author3", "book":"book6"}

Command to run:

 hadoop jar ./bookparse.jar my.books.CombineBooks /example.txt /test_output

Here's the conclusion:

 [pivhdsne:CombineBooks]$ hdfs dfs -cat /test_output/part-r-00000 {"books":[{"book":"book3"},{"book":"book2"},{"book":"book1"}],"author":"author1"} {"books":[{"book":"book5"},{"book":"book4"}],"author":"author2"} {"books":[{"book":"book6"}],"author":"author3"}

You can use three parameters to place the org.json.* Classes in the cluster:

Pack the org.json.* Classes in your jar file (this can be done using the GUI IDE). This is the option I used in my answer.
Place the jar file containing the org.json.* Classes on each cluster node in one of the CLASSPATH directories (see .application.classpath yarn)
Place the jar file containing org.json.* In HDFS ( hdfs dfs -put <org.json jar> <hdfs path> ) and use the job.addFileToClassPath call so that this jar file is available for all tasks that do your job in the cluster. In my answer you should add job.addFileToClassPath(new Path("<jar_file_on_hdfs_location>")); in main

JSON handling using java Mapreduce

More articles: