KStreams + Spark Streaming + Machine Learning

I am doing a POC to run a machine learning algorithm on a data stream.
My initial idea was to take data, use

Spark Streaming -> aggregate data from several tables -> start MLLib for the data stream -> output.

But I look through KStreams. Now i'm confused !!!

Questions:
1. What is the difference between Spark Streaming and Kafka Streaming?
2. How can I marry KStreams + Spark Streaming + Machine Learning?
3. My idea is to train test data continuously, rather than undergo periodic training.

+7
source share
4 answers

First of all, the term “Kafka Conflict Flow” is technically incorrect.

  • it is called the Kafka Streams API (aka Kafka threads )
  • it is part of the Apache Kafka and thus “owned” by the Apache Software Foundation (not Confluent)
  • Confluent Open Source and Confluent Enterprise are two offers from Confluent , which are used as Apache Kafka (and thus Kafka threads)

However, Confluent contributes a lot of code to Apache Kafka, including Kafka threads.

About the differences (I just highlighted some of the main differences and turn to the Internet and documentation for more information: http://docs.confluent.io/current/streams/index.html and http://spark.apache.org/streaming/ )

Spark Streaming:

  • micro batch processing (no real stream recording)
  • lack of subsecond delay
  • limited window operations
  • lack of event handling
  • processing infrastructure (difficult to operate and deploy)
  • part of Apache Spark - data processing framework
  • precise processing

Kafka Streams

  • record processing by record
  • ms latency
  • operations with extended windows
  • stream / table duality
  • event time, swallow time and processing time semantics
  • Java library (easy to run and deploy - it's just a Java application like any other)
  • part of Apache Kafka is a stream processing platform (i.e. it offers storage and processing right away)
  • processing at least once (processing exactly one time WIP; cf KIP-98 and KIP-129 )
  • elastic, i.e. dynamically scalable

Thus, there is no reason to “marry” both - this is a choice issue that you want to use.

My personal approach is that Spark is not a good solution for handling threads. If you want to use a library like Kafka streams or a framework like Apache Flink, Apache Storm or Apache Apex (which are a good option for handling a stream) depends on your use case (and possibly personal taste) and on it cannot answer SO.

The main difference between Kafka threads is that it is a library and does not require a processing cluster. And since it is part of Apache Kafka, and if you already have Apache Kafka, this can simplify your overall deployment, since you do not need to run an additional processor cluster.

+19
source

Apache Kafka Steams is a library and provides a built-in thread processing engine that is easy to use in Java applications for thread processing, and this is not the basis.

I found some usage examples when to use Kafka threads , and also a good one with the Apache flag from Kafka author.

+3
source

I recently presented this topic at a conference.

Apache Kafka Streams or Spark Streaming are typically used to apply a real-time machine learning model to new events through stream processing (process data while moving). Matthias's answer is already discussing their differences.

On the other hand, you first use things like Apache Spark MLlib (or H2O.ai or XYZ) to create analytic models first using historical datasets.

Kafka streams can also be used for online model training. Although, I think, online training has various reservations.

All of this is discussed in more detail in my slide deck, Apache Kafka Streams and Machine Learning / Deep Learning for Real-Time Stream Processing .

+2
source

Spark Streaming and KStreams in one image in terms of stream processing.

Spark and KStreams

It summarizes the significant benefits of Spark Streaming and KStreams.

Advantages of Spark Streaming over KStreams:

  1. Easy integration of Spark ML models and Graph calculations in one application without writing data outside the application, which means that you will process much faster than writing kafka and processing again.
  2. Combine non-streaming sources such as the file system and other non-kafka sources to other streaming sources in one application.
  3. Schema messages can be easily processed using your favorite SQL (StructuredStreaming).
  4. It is possible to do graph analysis of streaming data using the GraphX ​​built-in library.
  5. Spark applications can be deployed over (if) an existing YARN or Mesos cluster.

Advantages of KStreams:

  1. A compact library for ETL processing and maintenance / training of the ML model for rich messages. So far, both the source and the goal should only be Kafka's theme.
  2. Easy to achieve semantics.
  3. A separate processing cluster is not required.
  4. Easy to deploy in docker, as it is a simple Java application to run.
+2
source

Source: https://habr.com/ru/post/1261223/


All Articles