Serialization Issues Using Functional Implementations Using Spark

Question

Serialization Issues Using Functional Implementations Using Spark

I'm having issues with implementing Spark features in Java. The documentation provides three ways to use functions in map and reduce :

through lambda
through built-in classes that implement Function and Function2
through inner classes implementing Function and Function2

The problem is that I can’t work 2. and 3. .. For example, this code:

 public int countInline(String path) { String master = "local"; SparkConf conf = new SparkConf().setAppName("charCounterInLine") .setMaster(master); JavaSparkContext sc = new JavaSparkContext(conf); JavaRDD<String> lines = sc.textFile(path); JavaRDD<Integer> lineLengths = lines .map(new Function<String, Integer>() { public Integer call(String s) { return s.length(); } }); return lineLengths.reduce(new Function2<Integer, Integer, Integer>() { public Integer call(Integer a, Integer b) { return a + b; } }); // the line causing the error }

gives me this error:

 14/07/09 11:23:20 INFO DAGScheduler: Failed to run reduce at CharCounter.java:42 [WARNING] java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at org.codehaus.mojo.exec.ExecJavaMojo$1.run(ExecJavaMojo.java:297) at java.lang.Thread.run(Thread.java:745) Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: Hadoop.Spark.basique.CharCounter at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713) at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)

I can currently avoid this problem by implementing Function and Function2 in open external classes. However, this was a better guess than a well-designed solution. Moreover, since I cannot get it to work with sample documentation, I think there are things that I don’t understand.

In conclusion, my questions are:

How to make 2. and 3. work?
Why only lambda works?
Is there any other way to use functions ?

+6

java apache-spark

fxm Jul 9 '14 at 9:38

source share

2 answers

Adding "implements Serializable" to the closing class may solve the problem. It serializes a private class because the inner class is a member of it, but the inclusion class is not serializable.

+1

Pushpender Sep 28 '16 at 13:03

source share

Josh rosen · Accepted Answer · 2014-07-10T20:01:18+0000

The relevant part of this stracktrace:

 Task not serializable: java.io.NotSerializableException: Hadoop.Spark.basique.CharCounter

When you define your functions as inner classes, their closure object is drawn into the closure and serialization function. If this class is not serializable or contains a non-serializable field, then you will encounter this error.

You have several options:

Mark the non-serializable fields of the object as transient .
Define your functions as outer classes.
Define your functions as static nested classes .

Serialization Issues Using Functional Implementations Using Spark

More articles: