I just did it myself. I started writing a function to calculate the median, but found it faster and easier to get quantiles by running my RDD as a DataFrame and querying it using SQL. Here is an example:
// construct example RDD val rows = Seq(3, 1, 5, 1, 9, 2, 2) val rdd = sc.parallelize(rows) // construct Dataframe case class MedianDF(value: Long) val df = rdd.map(row => MedianDF(row.toLong)).toDF // register the table and then query for your desired percentile df.registerTempTable("table") sqlContext.sql("SELECT PERCENTILE(value, 0.5) FROM table").show()
Which returns 2, the median. Similarly, if you want the first quartile to just skip 0.25 to PERCENTILE:
sqlContext.sql("SELECT PERCENTILE(value, 0.25) FROM table").show()
source share