Parsing json input in java

My input is in hdf. I'm just trying to make wordcount, but there is a slight difference. The data is in json format. Therefore, each row of data:

{"author":"foo", "text": "hello"} {"author":"foo123", "text": "hello world"} {"author":"foo234", "text": "hello this world"} 

I only want to make a word combination in the "text" part.

How to do it?

I tried the following option:

 public static class TokenCounterMapper extends Mapper<Object, Text, Text, IntWritable> { private static final Log log = LogFactory.getLog(TokenCounterMapper.class); private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(Object key, Text value, Context context) throws IOException, InterruptedException { try { JSONObject jsn = new JSONObject(value.toString()); //StringTokenizer itr = new StringTokenizer(value.toString()); String text = (String) jsn.get("text"); log.info("Logging data"); log.info(text); StringTokenizer itr = new StringTokenizer(text); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); context.write(word, one); } } catch (JSONException e) { // TODO Auto-generated catch block e.printStackTrace(); } } } 

But I get this error:

 Error: java.lang.ClassNotFoundException: org.json.JSONException at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:301) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) at java.lang.Class.forName0(Native Method) at java.lang.Class.forName(Class.java:247) at org.apache.hadoop.conf.Configuration.getClassByName(Configuration.java:820) at org.apache.hadoop.conf.Configuration.getClass(Configuration.java:865) at org.apache.hadoop.mapreduce.JobContext.getMapperClass(JobContext.java:199) at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:719) at org.apache.hadoop.mapred.MapTask.run(MapTask.java:370) at org.apache.hadoop.mapred.Child$4.run(Child.java:255) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1093) at org.apache.hadoop.mapred.Child.main(Child.java:249) 
+6
source share
2 answers

It seems you forgot to embed the JSon library in your Hadoop working phrase. You can see how you can build your work in the library: http://tikalk.com/build-your-first-hadoop-project-maven

+3
source

There are several ways to use external cans with decreasing the code of your card:

  • Include the reference JAR in the lib subdirectory of the JAR to be sent: the job will decompress the JAR from this lib subdirectory into the job file on the appropriate TaskTracker nodes and direct your tasks to this directory to make the JAR available to your code. If the JAR files are small, often change and depend on the job, this is the preferred method. Here is what @clement suggested in his answer.

  • Install the JAR on the cluster nodes. The easiest way is to place the JAR in the $HADOOP_HOME/lib directory, since all of this directory will be included when the Hadoop daemon starts. Please note that a start-stop is required to achieve this.

  • TaskTrackers will use an external JAR, so you can provide it by changing the HADOOP_TASKTRACKER_OPTS parameter in the HADOOP_TASKTRACKER_OPTS configuration file and pointing to the jar. The ATM must be present on the same path at all nodes where the Task-Tracker is running.

  • Enable the JAR on the " -libjars " command line of the hadoop jar … command. The bank will be located in a distributed cache and will be available for all attempts to assign tasks. Your map reduction code should use GenericOptionsParser . Read this blog post for more details.

Comparison:

  • 1 is an inherited method, but it is discouraged because it has great negative performance.

  • 2 and # 3 are good for private clusters, but pretty lame practice as you cannot expect end users to do this.

  • 4 is the most recommended option.

Read the main post from Cloudera).

+1
source

Source: https://habr.com/ru/post/946133/


All Articles