I studied query optimization in the latest releases of Spark SQL 2.3.0-SNAPSHOT and noticed different physical plans for semantically identical queries.
Suppose I need to count the number of rows in the following dataset:
val q = spark.range(1)
I could count the number of rows as follows:
q.countq.collect.sizeq.rdd.countq.queryExecution.toRdd.count
My initial thought was that it was an almost constant operation (of course, due to the local data set), which was somehow optimized by Spark SQL and would immediately give a result, especially. 1st, where Spark SQL has full control over query execution.
Looking at the physical query plans, I thought that the most efficient query would be the last:
q.queryExecution.toRdd.count
Causes:
- This avoids binary deserialization of strings.
InternalRow - Request coordinated
.

? , , (, , JDBC, Kafka)?
: , , , , ( )?
.
q.count

q.collect.size

q.rdd.count
