How is the punctuation limit for Cassandra when using dataframes?

Question

How is the punctuation limit for Cassandra when using dataframes?

I have a large Cassandra table. I want to download only 50 lines from Cassandra. Following code

val ds = sparkSession.read
      .format("org.apache.spark.sql.cassandra")
      .options(Map("table" -> s"$Aggregates", "keyspace" -> s"$KeySpace"))
      .load()
      .where(col("aggregate_type") === "DAY")
      .where(col("start_time") <= "2018-03-28")
      .limit(50).collect()

The following code pushes both predicates out of methods where, but does not restrict them. Is it true that all data (1 million records) is received? If not, why run this code and code without limit(50)approximately the same.

+4

scala cassandra spark-dataframe spark-cassandra-connector

addmeaning Mar 28 '18 at 12:28

source share

1 answer

dk14 · Accepted Answer · 2018-04-01T20:37:51+0000

Spark Streaming, Spark , , . , , . -:

"".
limit(...) CQL LIMIT, , . :

Spark DataSource. , .

:

DataFrame numPartitions (concurrent.reads ). n ~ 50 " ", - where(dayIndex < 50 * factor * num_records).
CQL LIMIT SparkPartitionLimit, CQL () - , . CassandraRdd , RDD.

:

filteredDataFrame.rdd.asInstanceOf[CassandraRDD].limit(n).take(n).collect()

LIMIT $N CQL. DataFrame, CassandraRDD LIMIT (.limit(10).limit(20)) - . , n n / numPartitions + 1, ( Spark Cassandra ) (, - ). take(n), <= numPartitions * n n.

, where CQL ( explain()) - LIMIT .

P.S. CQL , sparkSession.sql(...) ( ) .

How is the punctuation limit for Cassandra when using dataframes?

More articles: