I have set up Spark 2.0 and Cassandra 3.0 on a local computer (8 cores, 16 GB of RAM) for testing purposes and edited spark-defaults.confas follows:
spark.python.worker.memory 1g
spark.executor.cores 4
spark.executor.instances 4
spark.sql.shuffle.partitions 4
Then I imported 1.5 million lines into Cassandra:
test(
tid int,
cid int,
pid int,
ev list<double>,
primary key (tid)
)
test.ev is a list containing numerical values, i.e. [2240,2081,159,304,1189,1125,1779,693,2187,1738,546,496,382,1761,680]
Now in the code, to check everything that I just created SparkSession, connected to Cassandra and made a simple count:
cassandra = spark.read.format("org.apache.spark.sql.cassandra")
df = cassandra.load(keyspace="testks",table="test")
df.select().count()
At this point, Spark outputs countand takes about 28 seconds to complete Job, distributed at 13 Tasks(in Spark UI, general input for tasks is 331.6 MB)
Questions:
- Is this expected performance? If not, what am I missing?
- , DataFrame , Spark .
spark.sql.shuffle.partitions 4, 13 ? ( , , rdd.getNumPartitions() DataFrame)
, :
- , , 100 000 ~ N ,
pid ev, a list<double>- , , i.e
df.groupBy('pid').agg(avg(df['ev'][1]))
@zero323, (2Gb RAM, 4 , SSD) Cassandra . df.select().count() ( 70 , Job).
: . @zero323 , Cassandra Spark SQL,
, - list<double> , , , .