I just built Spark on a machine under Windows 7 (using sbt ) and go through a quick start . The Spark job does not work when calling first() .
I am new to Java and have no clear idea of ββwhat the stacktrace error shows, although it seems to be related to java.net.SocketException with regard to messaging. Note. I do not use the Hadoop installation. Also note that there are no errors when running this example in Scala.
Environment:
Windows 7
Spark 1.2.1
Anaconda Python 2.7.8
Scala 2.10.4
sbt 0.13.7
jdk 1.7.0.75
In [2]: path = u'C:\\Users\\striji\\Documents\\Personal\\python\\pyspark-flights\\2001.csv.bz2' In [3]: textFile = sc.textFile(path) In [4]: textFile Out[4]: C:\Users\striji\Documents\Personal\python\pyspark-flights\2001.csv.bz2 MappedRDD[1] at textFile at NativeMethodAccessorImpl.java:-2 In [5]: textFile.count() ... Out[5]: 5967781 In [6]: textFile.first() 15/02/19 08:52:01 INFO SparkContext: Starting job: runJob at PythonRDD.scala:344 15/02/19 08:52:01 INFO DAGScheduler: Got job 1 (runJob at PythonRDD.scala:344) with 1 output partitions (allowLocal=true) 15/02/19 08:52:01 INFO DAGScheduler: Final stage: Stage 1(runJob at PythonRDD.scala:344) 15/02/19 08:52:01 INFO DAGScheduler: Parents of final stage: List() 15/02/19 08:52:01 INFO DAGScheduler: Missing parents: List() 15/02/19 08:52:01 INFO DAGScheduler: Submitting Stage 1 (PythonRDD[3] at RDD at PythonRDD.scala:43), which has no missing parents 15/02/19 08:52:01 INFO MemoryStore: ensureFreeSpace(4560) called with curMem=46832, maxMem=278302556 15/02/19 08:52:01 INFO MemoryStore: Block broadcast_2 stored as values in memory (estimated size 4.5 KB, free 265.4 MB) 15/02/19 08:52:01 INFO MemoryStore: ensureFreeSpace(3417) called with curMem=51392, maxMem=278302556 15/02/19 08:52:01 INFO MemoryStore: Block broadcast_2_piece0 stored as bytes in memory (estimated size 3.3 KB, free 265.4 MB) 15/02/19 08:52:01 INFO BlockManagerInfo: Added broadcast_2_piece0 in memory on localhost:51106 (size: 3.3 KB, free: 265.4 MB) 15/02/19 08:52:01 INFO BlockManagerMaster: Updated info of block broadcast_2_piece0 15/02/19 08:52:01 INFO SparkContext: Created broadcast 2 from broadcast at DAGScheduler.scala:838 15/02/19 08:52:01 INFO DAGScheduler: Submitting 1 missing tasks from Stage 1 (PythonRDD[3] at RDD at PythonRDD.scala:43) 15/02/19 08:52:01 INFO TaskSchedulerImpl: Adding task set 1.0 with 1 tasks 15/02/19 08:52:01 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 1, localhost, PROCESS_LOCAL, 1341 bytes) 15/02/19 08:52:01 INFO Executor: Running task 0.0 in stage 1.0 (TID 1) 15/02/19 08:52:04 INFO HadoopRDD: Input split: file:/C:/Users/striji/Documents/Personal/python/pyspark-flights/2001.csv.bz2:0+83478700 15/02/19 08:52:04 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 1) java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/19 08:52:05 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 1, localhost): java.net.SocketException: Connection reset at java.net.SocketInputStream.read(SocketInputStream.java:196) at java.net.SocketInputStream.read(SocketInputStream.java:122) at java.io.BufferedInputStream.fill(BufferedInputStream.java:235) at java.io.BufferedInputStream.read(BufferedInputStream.java:254) at java.io.DataInputStream.readInt(DataInputStream.java:387) at org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:110) at org.apache.spark.api.python.PythonRDD$$anon$1.<init>(PythonRDD.scala:174) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:96) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:280) at org.apache.spark.rdd.RDD.iterator(RDD.scala:247) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61) at org.apache.spark.scheduler.Task.run(Task.scala:56) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:200) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) at java.lang.Thread.run(Thread.java:745) 15/02/19 08:52:05 ERROR TaskSetManager: Task 0 in stage 1.0 failed 1 times; aborting job 15/02/19 08:52:05 INFO TaskSchedulerImpl: Removed TaskSet 1.0, whose tasks have all completed, from pool 15/02/19 08:52:05 INFO TaskSchedulerImpl: Cancelling stage 1 15/02/19 08:52:05 INFO DAGScheduler: Job 1 failed: runJob at PythonRDD.scala:344, took 3.796728 s
source share