Spark: PySpark + Cassandra query performance

I have set up Spark 2.0 and Cassandra 3.0 on a local computer (8 cores, 16 GB of RAM) for testing purposes and edited spark-defaults.confas follows:

spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4

Then I imported 1.5 million lines into Cassandra:

test(
    tid int,
    cid int,
    pid int,
    ev list<double>,
    primary key (tid)
)

test.ev is a list containing numerical values, i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]

Now in the code, to check everything that I just created SparkSession, connected to Cassandra and made a simple count:

cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()

At this point, Spark outputs countand takes about 28 seconds to complete Job, distributed at 13 Tasks(in Spark UI, general input for tasks is 331.6 MB)

Questions:

  • Is this expected performance? If not, what am I missing?
  • , DataFrame , Spark . spark.sql.shuffle.partitions 4, 13 ? ( , , rdd.getNumPartitions() DataFrame)

, :

  • , , 100 000 ~ N , pid
  • ev, a list<double>
  • , , i.e df.groupBy('pid').agg(avg(df['ev'][1]))

@zero323, (2Gb RAM, 4 , SSD) Cassandra . df.select().count() ( 70 , Job).

: . @zero323 , Cassandra Spark SQL,

, - list<double> , , , .

+4
1

? , ?

, . count

SELECT 1 FROM table

. , , , N , .

docs Cassandra RDD ( Datasets) cassandraCount, .

, DataFrame , Spark . spark.sql.shuffle.partitions to (...), (...) ?

spark.sql.shuffle.partitions . ( ), Dataset , count(*) ( 1 ).

, spark.cassandra.input.split.size_in_mb, :

, Spark. Spark 1 + 2 * SparkContext.defaultParallelism

, spark.default.parallelism, , .

+3

Source: https://habr.com/ru/post/1655086/


All Articles