Calculation of the first quartile for a numerical column in a spark

Question

Calculation of the first quartile for a numerical column in a spark

I am new to sparks / scala. This is what I do to calculate the first quartile of the csv file

val column= sc.textFile("test.txt").map(_.split(",")(2)).flatMap(_.split(",")).map((_.toDouble)) val total = column.count.toDouble val upper=(total+1)/4 val upper2= scala.math.ceil(upper).toInt

I'm not sure how to sort a column other than adding a pair of key values. all i need is to take the last 2 values for the quartiles after sorting them. But I have to create a couple of key values.

 val quartiles = column.map((_,1)).sortByKey(true).take(upper2) val first_quartile =0 if(upper % upper.toInt >0){ first_quartile = quartiles(upper.toInt-1) }else{ first_quartile = (quartiles(upper2-1) +(quartiles(upper2-2))/2 }

This works, but it will leave me with an annoying pair of key values. how to return to 1 column instead of 2 (for example, a pair of key values)

+6

scala apache-spark

user2773013 Jun 23 '14 at 23:54

source share

1 answer

Erin shellman · Answer 1 · 2015-06-17T19:17:16+0000

I just did it myself. I started writing a function to calculate the median, but found it faster and easier to get quantiles by running my RDD as a DataFrame and querying it using SQL. Here is an example:

  // construct example RDD val rows = Seq(3, 1, 5, 1, 9, 2, 2) val rdd = sc.parallelize(rows) // construct Dataframe case class MedianDF(value: Long) val df = rdd.map(row => MedianDF(row.toLong)).toDF // register the table and then query for your desired percentile df.registerTempTable("table") sqlContext.sql("SELECT PERCENTILE(value, 0.5) FROM table").show()

Which returns 2, the median. Similarly, if you want the first quartile to just skip 0.25 to PERCENTILE:

 sqlContext.sql("SELECT PERCENTILE(value, 0.25) FROM table").show()

Calculation of the first quartile for a numerical column in a spark

More articles: