Spark Streaming + Kafka vs Just Kafka

Question

Spark Streaming + Kafka vs Just Kafka

Why and when did you decide to use Spark streaming with Kafka?

Suppose I have a system that receives a thousand messages per second through Kafka. I need to apply some real-time analytics for these messages and save the result in the database.

I have two options:

Create my own employee who reads messages from Kafka, runs the analytics algorithm and saves the result to the database. In the Docker era, it's easy to scale this worker across my entire cluster with the help of the scale command. I just need to make sure that I have an equal or more number of partitions than my workers, and everything is fine, and I have real concurrency.
Create a Spark cluster with Kafka streaming input. Let the Spark cluster perform analytics calculations and then save the result.

Is there a case where the second option is the best choice? It seems to me that this is just additional overhead.

+10

apache-spark apache-kafka spark-streaming spark-streaming-kafka

Sash Jul 23 '17 at 8:11

source share

1 answer

cricket_007 · Answer 1 · 2018-05-13T22:47:37+0000

In the Docker era, it's easy to scale this worker across my entire cluster.

If you already have access to this infrastructure, then use it. Link your Kafka libraries in some kind of minimal container with health checks and what not, and for the most part this works great. Adding a Kafka client dependency + database dependency is all you really need, right?

If you are not using Spark, Flink, etc., you will need to handle Kafka errors, try again, offset and process more closely with your code, and not let the infrastructure handle them for you.

I will add that if you want to interact with Kafka + Database, check out the Kafka Connect API. Existing solutions already exist for JDBC, Mongo, Couchbase, Cassandra, etc.

If you need more complete computing power, I would go for Kafka Streams instead of separately supporting the Spark cluster, and so that "just Kafka"

Create a spark cluster

Suppose you do not want to support this, or rather, you cannot choose between YARN, Mesos, Kubernetes, or Standalone. And if you're using the first three, it might be worth a look at how Docker works.

You are absolutely right that this is an additional overhead, so I find everything depending on what you have (for example, an existing Hadoop / YARN cluster with free space resources), or what you are willing to support domestically (or pay for service providers, for example, Kafka & Databricks in some hosted solution).

In addition, Spark does not work with the latest Kafka client library (to version 2.4.0, upgraded to Kafka 2.0, I believe), so you will need to determine if this is a selling point.

For real-life streaming libraries, not Spark, Apache Beam, or Flink packages, they will probably allow you to do the same types of workloads against Kafka

In general, in order to scale the producer / consumer, you need some kind of resource planner. Installing Spark may not be easy for some, but knowing how to use it efficiently and configure it for related resources may be

Spark Streaming + Kafka vs Just Kafka

More articles: