First of all, the term “Kafka Conflict Flow” is technically incorrect.
- it is called the Kafka Streams API (aka Kafka threads )
- it is part of the Apache Kafka and thus “owned” by the Apache Software Foundation (not Confluent)
- Confluent Open Source and Confluent Enterprise are two offers from Confluent , which are used as Apache Kafka (and thus Kafka threads)
However, Confluent contributes a lot of code to Apache Kafka, including Kafka threads.
About the differences (I just highlighted some of the main differences and turn to the Internet and documentation for more information: http://docs.confluent.io/current/streams/index.html and http://spark.apache.org/streaming/ )
Spark Streaming:
- micro batch processing (no real stream recording)
- lack of subsecond delay
- limited window operations
- lack of event handling
- processing infrastructure (difficult to operate and deploy)
- part of Apache Spark - data processing framework
- precise processing
Kafka Streams
- record processing by record
- ms latency
- operations with extended windows
- stream / table duality
- event time, swallow time and processing time semantics
- Java library (easy to run and deploy - it's just a Java application like any other)
- part of Apache Kafka is a stream processing platform (i.e. it offers storage and processing right away)
- processing at least once (processing exactly one time WIP; cf KIP-98 and KIP-129 )
- elastic, i.e. dynamically scalable
Thus, there is no reason to “marry” both - this is a choice issue that you want to use.
My personal approach is that Spark is not a good solution for handling threads. If you want to use a library like Kafka streams or a framework like Apache Flink, Apache Storm or Apache Apex (which are a good option for handling a stream) depends on your use case (and possibly personal taste) and on it cannot answer SO.
The main difference between Kafka threads is that it is a library and does not require a processing cluster. And since it is part of Apache Kafka, and if you already have Apache Kafka, this can simplify your overall deployment, since you do not need to run an additional processor cluster.
source share