A spark creating too many partitions

Question

A spark creating too many partitions

I have 3 Cassandra node clusters with 1 seed node and 1 spark master and 3 slave nodes with 8 GB of RAM and 2 cores. Here is the contribution to my spark assignments

spark.cassandra.input.split.size_in_mb 67108864

When I run this configuration set, I see that about 89.1 MB of data of approximately 1706765 records are created around 768 partitions. I cannot understand why so many partitions are being created. I use the Cassandra spark plug version 1.4, so the error is also fixed regarding the size of the split input.

There are only 11 unique partition keys. My section key has an application name, which is always a test and random number, which is always from 0 to 10, so there are only 11 different unique sections.

Why are there so many sections and how does the question of how many sections to create

+1

cassandra apache-spark

Nipun Sep 26 '15 at 16:07

source share

1 answer

RussS · Answer 1 · 2015-10-01T15:57:46+0000

The Cassandra connector does not use defaultParallelism. It checks the system table in C * (message 2.1.5) to evaluate how much MB of data is in the table. This value is read and divided by the size of the diversity input to determine the number of dividers.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size

If you are on C * <2.1.5 you will need to manually set the splitting using ReadConf.

A spark creating too many partitions

More articles: