DataFrames is a pretty new concept for me. Several sources have recommended it over RDD and how it outperforms RDD in many situations. I would like to see if DataFrames is a viable option (in the end, I will deal with an array of bytes), so I compared the performance of the Word Count application in a 1 GB file.
In any case, when I ran the program, I got the following results:
RDD Total: 137733312 Elapsed time: 44.5675378 s
DF Total: 137733312 Elapsed time: 69.201253448 s
I expected DataFrames to run faster than RDD. Is this the result of poor implementation? Or, since the implementation of the DataFrame is called textFile, the data was loaded into an RDD and then converted to a DataFrame. Does it affect performance? Is it recommended to convert my file to a Parquet file (since then, as the default data source) and load directly from it?
I was wondering if anyone could explain why RDD outperforms DataFrames by a pretty significant margin.
def testDF(sc: SparkContext, sqlContext: SQLContext,
fname: String, threshold: Int): Long = {
import sqlContext.implicits._
val linesDF = sc.textFile(fname).toDF("line")
val tokenizer = new Tokenizer().setInputCol("line").setOutputCol("words")
val wordsDF = tokenizer.transform(linesDF)
val countUDF = udf((data: WrappedArray[String]) => data.size)
val countTotal = wordsDF.withColumn("count", countUDF('words)).agg(sum("count"))
countTotal.first()(0).asInstanceOf[Long]
}
def testRDD(sc: SparkContext, fname: String): Int = {
// split each document into words
val tokenized = sc.textFile(fname).flatMap(_.split(" "))
// count the occurrence of each word
val wordCounts = tokenized.map((_, 1)).reduceByKey(_ + _)
// count characters
val countTotal: Int = wordCounts.map(_._2).reduce((a,b) => a + b)
countTotal
}
gofly source
share