I am using a simplified application for the Spark / Cassandra cluster. Starting with the transition to a new environment (Spark 1.5 instead of 1.2 and a minor update of the Cassandra version), there was a significant decrease in performance (from 4 to 1-5 m for one task and the same amount of data).
After the initial study, it seems that for exactly the same code, from the point of view of intrinsic safety, there are still many tasks (20 + k, where it used to be up to 5), and the logs at the end of the executor also reflect the same situation:
many consecutive executions of the same request on different sections:
... CassandraTableScanRDD: Fetched 0 rows from xy for partition 20324 in 0.138 s. CassandraTableScanRDD: Fetched 0 rows from xy for partition 20327 in 0.058 s. CassandraTableScanRDD: Fetched 0 rows from xy for partition 20329 in 0.053 s. ...
where he was the only one:
CassandraTableScanRDD: Fetched 905 rows from xy for partition 0 in 2.992 s.
Since the application code is the same, I wonder what could cause such a difference in partitioning behavior and what can be done to fix this?
NB! Setting up both environments, if they are different, the configuration is not shared / inherited.
Thanks.
source share