How is the punctuation limit for Cassandra when using dataframes?

I have a large Cassandra table. I want to download only 50 lines from Cassandra. Following code

val ds = sparkSession.read
      .format("org.apache.spark.sql.cassandra")
      .options(Map("table" -> s"$Aggregates", "keyspace" -> s"$KeySpace"))
      .load()
      .where(col("aggregate_type") === "DAY")
      .where(col("start_time") <= "2018-03-28")
      .limit(50).collect()

The following code pushes both predicates out of methods where, but does not restrict them. Is it true that all data (1 million records) is received? If not, why run this code and code without limit(50)approximately the same.

+4
source share
1 answer

Spark Streaming, Spark , , . , , . -:

  • "".

  • limit(...) CQL LIMIT, , . :

Spark DataSource. , .

:

  • DataFrame numPartitions (concurrent.reads ). n ~ 50 " ", - where(dayIndex < 50 * factor * num_records).

  • CQL LIMIT SparkPartitionLimit, CQL () - , . CassandraRdd , RDD.

:

filteredDataFrame.rdd.asInstanceOf[CassandraRDD].limit(n).take(n).collect()

LIMIT $N CQL. DataFrame, CassandraRDD LIMIT (.limit(10).limit(20)) - . , n n / numPartitions + 1, ( Spark Cassandra ) (, - ). take(n), <= numPartitions * n n.

, where CQL ( explain()) - LIMIT .

P.S. CQL , sparkSession.sql(...) ( ) .

+4

Source: https://habr.com/ru/post/1695469/


All Articles