What are we doing:
- Installing Spark 0.9.1 in accordance with the documentation on the website, as well as CDH4 (and another cluster with CDH5) hadoop / hdfs distributions.
- Building a fat jar using a Spark application using sbt, then trying to run it in a cluster
I also added code snippets, and sbt deps below.
When I have Googled this seems to have a few somewhat vague answers: a) Mismatch of spark versions on nodes / user code b) More cans need to be added to SparkConf
Now I know that (b) is not a problem that successfully runs the same code on other clusters, while it includes only one jar (this is a thick jar).
But I have no idea how to check (a) - it seems that Spark does not have version checks or something else - it would be nice if he checked the versions and chose the "inappropriate version" exception: you have a user code using version X and node Y have version Z. "
I would be very grateful for the advice on this. I sent an error report because something was wrong in the Spark documentation, because I saw that two independent system descriptors were getting the same problem with different versions of CDH on different clusters. https://issues.apache.org/jira/browse/SPARK-1867
An exception:
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task 0.0:1 failed 32 times (most recent failure: Exception failure: java.lang.IllegalStateException: unread block data) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1020) at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1018) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$abortStage(DAGScheduler.scala:1018) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$processEvent$10.apply(DAGScheduler.scala:604) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.processEvent(DAGScheduler.scala:604) at org.apache.spark.scheduler.DAGScheduler$$anonfun$start$1$$anon$2$$anonfun$receive$1.applyOrElse(DAGScheduler.scala:190) at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) at akka.actor.ActorCell.invoke(ActorCell.scala:456) at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) at akka.dispatch.Mailbox.run(Mailbox.scala:219) at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) 14/05/16 18:05:31 INFO scheduler.TaskSetManager: Loss was due to java.lang.IllegalStateException: unread block data [duplicate 59]
My code snippet:
val conf = new SparkConf() .setMaster(clusterMaster) .setAppName(appName) .setSparkHome(sparkHome) .setJars(SparkContext.jarOfClass(this.getClass)) println("count = " + new SparkContext(conf).textFile(someHdfsPath).count())
My SBT dependencies:
// relevant "org.apache.spark" % "spark-core_2.10" % "0.9.1", "org.apache.hadoop" % "hadoop-client" % "2.3.0-mr1-cdh5.0.0", // standard, probably unrelated "com.github.seratch" %% "awscala" % "[0.2,)", "org.scalacheck" %% "scalacheck" % "1.10.1" % "test", "org.specs2" %% "specs2" % "1.14" % "test", "org.scala-lang" % "scala-reflect" % "2.10.3", "org.scalaz" %% "scalaz-core" % "7.0.5", "net.minidev" % "json-smart" % "1.2"
source share