Corrected streaming from Kafka error resulting in data loss

I have a Spark Streaming application written in Python that collects data from Kafka and stores it in the file system. When I run it, I see a lot of “holes” in the collected data. After analyzing the logs, I realized that 285,000 of the 302,000 jobs failed, all with the same exception:

Job aborted due to stage failure: Task 4 in stage 604348.0 failed 1 times, 
most recent failure: Lost task 4.0 in stage 604348.0 (TID 2097738, localhost): 
kafka.common.OffsetOutOfRangeException
    at sun.reflect.GeneratedConstructorAccessor26.newInstance(Unknown Source)
    at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
    at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
    at java.lang.Class.newInstance(Class.java:442)
    at kafka.common.ErrorMapping$.exceptionFor(ErrorMapping.scala:86)
    at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.handleFetchErr(KafkaRDD.scala:184)
    at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.fetchBatch(KafkaRDD.scala:193)
    at org.apache.spark.streaming.kafka.KafkaRDD$KafkaRDDIterator.getNext(KafkaRDD.scala:208)
    at org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)
    at scala.collection.Iterator$class.foreach(Iterator.scala:727)
    at org.apache.spark.util.NextIterator.foreach(NextIterator.scala:21)
    at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:452)
    at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:280)
    at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1741)
    at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:239)

, Kafka. 1 , , - , Kafka. , , , ( ), - .

+4

Source: https://habr.com/ru/post/1678487/


All Articles