In this simplified example with DataFrame.select and DataFrame.rdd.map I think the difference can be almost negligible.
After you have already uploaded your dataset and performed only projection. In the end, both had to deserialize the data from the Spark InternalRow column format in order to calculate the result for the action.
You can check what happens with the DataFrame.select on explain(extended = true) , where you will learn about physical plans (and the physical plane too).
scala> spark.version res4: String = 2.1.0-SNAPSHOT scala> spark.range(5).select('id).explain(extended = true) == Parsed Logical Plan == 'Project [unresolvedalias('id, None)] +- Range (0, 5, step=1, splits=Some(4)) == Analyzed Logical Plan == id: bigint Project [id#17L] +- Range (0, 5, step=1, splits=Some(4)) == Optimized Logical Plan == Range (0, 5, step=1, splits=Some(4)) == Physical Plan == *Range (0, 5, step=1, splits=Some(4))
Compare the physical plan (i.e. SparkPlan ) with what you are doing with rdd.map (via toDebugString ) and you will find out what could be “better”.
scala> spark.range(5).rdd.toDebugString res5: String = (4) MapPartitionsRDD[8] at rdd at <console>:24 [] | MapPartitionsRDD[7] at rdd at <console>:24 [] | MapPartitionsRDD[6] at rdd at <console>:24 [] | MapPartitionsRDD[5] at rdd at <console>:24 [] | ParallelCollectionRDD[4] at rdd at <console>:24 []
(again, in this contrived example, I think that there is no winner - both are most effective).
Note that the DataFrame indeed a Dataset[Row] , which uses RowEncoder to encode (that is, serialize) the data into the binary format of the InternalRow column. If you were to execute more statements in the pipeline, you could get much better performance with Dataset binding than RDD , just because there are low-level backstage plans for logical query plans and a binary column format.
There are many optimizations, and trying to defeat them can often be a waste of time. You will need to know the insides of Spark by heart in order to get better performance (and the price would certainly be readable).
There are many, and I highly recommend watching the Deep Dive talk in the Catalog Optimizer by Herman van Hovell to find out and appreciate all the optimizations.
I think this is ... "Stay away from RDD if you don't know what you are doing."