I am writing UDAF to apply to a column of a Spark data frame of type Vector (spark.ml.linalg.Vector). I rely on the spark.ml.linalg package, so I don't have to go back and forth between dataframe and RDD.
Inside UDAF, I have to specify the data type for input, buffer, and output schemes:
def inputSchema = new StructType().add("features", new VectorUDT())
def bufferSchema: StructType =
StructType(StructField("list_of_similarities", ArrayType(new VectorUDT(), true), true) :: Nil)
override def dataType: DataType = ArrayType(DoubleType,true)
VectorUDT is what I will use with spark.mllib.linalg.Vector:
https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/ linalg / Vectors.scala
However, when I try to import it from spark.ml instead: import org.apache.spark.ml.linalg.VectorUDT
I get a runtime error (no errors during build):
class VectorUDT in package linalg cannot be accessed in package org.apache.spark.ml.linalg
Is / expected / can you suggest a workaround?
I am using Spark 2.0.0