Sparks 1.5.2: NaN when calculating stddev

I get NaN when calculating standard deviation (stddev). This is a very simple use case, as described below:

 val df = Seq(("1",19603176695L),("2", 26438904194L),("3",29640527990L),("4",21034972928L),("5", 23975L)).toDF("v","data")

I have stddev defined as UDF:

def stddev(col: Column) = {
        sqrt(mean(col*col) - mean(col)*mean(col))
 }

I get NaNwhen I call UDF, as shown below:

df.agg(stddev(col("data")).as("stddev")).show() 

This produces the following:

+------+
|stddev|
+------+
|   NaN|
+------+

What am I doing wrong?

+4
source share
1 answer

Given your data as mean(col*col)well as mean(col)*mean(col)to be greater than the maximum values Long. You can first start inputting input columns double:

df.agg(stddev(col("data").cast("double")).as("stddev"))

but overall it will not be particularly stable on very large numbers.

+3
source

Source: https://habr.com/ru/post/1652253/


All Articles