Pyspark crashes when I run the `first` or` take` method on Windows 7

I just run the command:

>>> lines = sc.textFile("C:\Users\elqstux\Desktop\dtop.txt") >>> lines.count() // this work fine >>> lines.first() // this crash 

Here is the error report:

 >>> lines.first() 15/11/18 17:33:35 INFO SparkContext: Starting job: runJob at PythonRDD.scala:393 15/11/18 17:33:35 INFO DAGScheduler: Got job 21 (runJob at PythonRDD.scala:393) with 1 output partitions 15/11/18 17:33:35 INFO DAGScheduler: Final stage: ResultStage 21(runJob at Pytho nRDD.scala:393) 15/11/18 17:33:35 INFO DAGScheduler: Parents of final stage: List() 15/11/18 17:33:35 INFO DAGScheduler: Missing parents: List() 15/11/18 17:33:35 INFO DAGScheduler: Submitting ResultStage 21 (PythonRDD[28] at RDD at PythonRDD.scala:43), which has no missing parents 15/11/18 17:33:35 INFO MemoryStore: ensureFreeSpace(4824) called with curMem=619 446, maxMem=555755765 15/11/18 17:33:35 INFO MemoryStore: Block broadcast_24 stored as values in memor y (estimated size 4.7 KB, free 529.4 MB) 15/11/18 17:33:35 INFO MemoryStore: ensureFreeSpace(3067) called with curMem=624 270, maxMem=555755765 15/11/18 17:33:35 INFO MemoryStore: Block broadcast_24_piece0 stored as bytes in memory (estimated size 3.0 KB, free 529.4 MB) 15/11/18 17:33:35 INFO BlockManagerInfo: Added broadcast_24_piece0 in memory on localhost:55487 (size: 3.0 KB, free: 529.9 MB) 15/11/18 17:33:35 INFO SparkContext: Created broadcast 24 from broadcast at DAGS cheduler.scala:861 15/11/18 17:33:35 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 21 (PythonRDD[28] at RDD at PythonRDD.scala:43) 15/11/18 17:33:35 INFO TaskSchedulerImpl: Adding task set 21.0 with 1 tasks 15/11/18 17:33:35 INFO TaskSetManager: Starting task 0.0 in stage 21.0 (TID 33, localhost, PROCESS_LOCAL, 2148 bytes) 15/11/18 17:33:35 INFO Executor: Running task 0.0 in stage 21.0 (TID 33) 15/11/18 17:33:35 INFO HadoopRDD: Input split: file:/C:/Users/elqstux/Desktop/dt op.txt:0+112852 15/11/18 17:33:36 INFO PythonRunner: Times: total = 629, boot = 626, init = 3, f inish = 0 15/11/18 17:33:36 ERROR PythonRunner: Python worker exited unexpectedly (crashed ) java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 ) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. apply(PythonRDD.scala:283) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s cala:239) 15/11/18 17:33:36 ERROR PythonRunner: This may have been caused by a prior excep tion: java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 ) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. apply(PythonRDD.scala:283) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s cala:239) 15/11/18 17:33:36 ERROR Executor: Exception in task 0.0 in stage 21.0 (TID 33) java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 ) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. apply(PythonRDD.scala:283) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s cala:239) 15/11/18 17:33:36 WARN TaskSetManager: Lost task 0.0 in stage 21.0 (TID 33, loca lhost): java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 ) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. apply(PythonRDD.scala:283) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s cala:239) 15/11/18 17:33:36 ERROR TaskSetManager: Task 0 in stage 21.0 failed 1 times; abo rting job 15/11/18 17:33:36 INFO TaskSchedulerImpl: Removed TaskSet 21.0, whose tasks have all completed, from pool 15/11/18 17:33:36 INFO TaskSchedulerImpl: Cancelling stage 21 15/11/18 17:33:36 INFO DAGScheduler: ResultStage 21 (runJob at PythonRDD.scala:3 93) failed in 0.759 s 15/11/18 17:33:36 INFO DAGScheduler: Job 21 failed: runJob at PythonRDD.scala:39 3, took 0.810138 s Traceback (most recent call last): File "<stdin>", line 1, in <module> File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1317, in first rs = self.take(1) File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\rdd.py", line 1299, in take res = self.context.runJob(self, takeUpToNumLeft, p) File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\context.py", line 916, in ru nJob port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partition s) File "c:\spark-1.5.2-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\java_g ateway.py", line 538, in __call__ File "c:\spark-1.5.2-bin-hadoop2.6\python\pyspark\sql\utils.py", line 36, in d eco return f(*a, **kw) File "c:\spark-1.5.2-bin-hadoop2.6\python\lib\py4j-0.8.2.1-src.zip\py4j\protoc ol.py", line 300, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark. api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in s tage 21.0 failed 1 times, most recent failure: Lost task 0.0 in stage 21.0 (TID 33, localhost): java.net.SocketException: Connection reset by peer: socket write error at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 ) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. apply(PythonRDD.scala:283) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s cala:239) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DA GScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1283) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D AGScheduler.scala:1271) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(D AGScheduler.scala:1270) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray. scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala :1270) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$ 1.apply(DAGScheduler.scala:697) at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$ 1.apply(DAGScheduler.scala:697) at scala.Option.foreach(Option.scala:236) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGSchedu ler.scala:697) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(D AGScheduler.scala:1496) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG Scheduler.scala:1458) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAG Scheduler.scala:1447) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567 ) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1824) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1837) at org.apache.spark.SparkContext.runJob(SparkContext.scala:1850) at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393) at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl. java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAcces sorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:483) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379) at py4j.Gateway.invoke(Gateway.java:259) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:207) at java.lang.Thread.run(Thread.java:745) Caused by: java.net.SocketException: Connection reset by peer: socket write erro r at java.net.SocketOutputStream.socketWrite0(Native Method) at java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:109) at java.net.SocketOutputStream.write(SocketOutputStream.java:153) at java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82 ) at java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) at java.io.DataOutputStream.flush(DataOutputStream.java:123) at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3. apply(PythonRDD.scala:283) at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1699) at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.s cala:239) 

when I run take , it also crashes, I can’t find a reason, who can help me?

+5
source share
1 answer

I am stuck for hours with the same issue in Windows 7 and Spark 1.5.0 (Python 2.7.11). I decided only to switch to Unix using exactly the same build. This is not an elegant solution, but I could not find another way to solve it.

0
source

Source: https://habr.com/ru/post/1236217/


All Articles