Excessive partitioning (too many tasks) to Apache Spark / Cassandra cluster

I am using a simplified application for the Spark / Cassandra cluster. Starting with the transition to a new environment (Spark 1.5 instead of 1.2 and a minor update of the Cassandra version), there was a significant decrease in performance (from 4 to 1-5 m for one task and the same amount of data).

After the initial study, it seems that for exactly the same code, from the point of view of intrinsic safety, there are still many tasks (20 + k, where it used to be up to 5), and the logs at the end of the executor also reflect the same situation:

many consecutive executions of the same request on different sections:

... CassandraTableScanRDD: Fetched 0 rows from xy for partition 20324 in 0.138 s. CassandraTableScanRDD: Fetched 0 rows from xy for partition 20327 in 0.058 s. CassandraTableScanRDD: Fetched 0 rows from xy for partition 20329 in 0.053 s. ... 

where he was the only one:

 CassandraTableScanRDD: Fetched 905 rows from xy for partition 0 in 2.992 s. 

Since the application code is the same, I wonder what could cause such a difference in partitioning behavior and what can be done to fix this?

NB! Setting up both environments, if they are different, the configuration is not shared / inherited.

Thanks.

+5
source share
1 answer

The new version of the Spark Cassandra Connector uses a system table inside the more modern Cassandra to estimate split size. This table is updated every (5 minutes at present), although the number of partitions you see is extremely large. The value read from this table is divided by your split size.

If you use C * less than 2.1.5, this table does not exist, and the partition will need to be done manually.

https://github.com/datastax/spark-cassandra-connector/blob/master/doc/FAQ.md#what-does-inputsplitsize_in_mb-use-to-determine-size

You can manually transfer the number of partitions using ReadConf if you continue to see problems.

+5
source

Source: https://habr.com/ru/post/1232695/


All Articles