How to convert RDD <String> to RDD <Vector> in Spark?
I have a file where each line is in this way
info1,info2
info3,info4
...
After scanning, I want to run the k-means algorithm:
val rawData = sc.textFile(myFile)
val converted = convertToVector(rawData)
val kmeans = new KMeans()
kmeans.setK(10)
kmeans.setRuns(10)
kmeans.setEpsilon(1.0e-6)
val model = kmeans.run(rawData) -> problem: k-means accepts only RDD<Vector>
Since k-mean only accepts RDD<Vector>, I created a function that converts mine RDD<String> rawDatato RDD<Vector>. But I am fixated on how to do this, this function below is in the process:
def converToVector(rawData: RDD[String]): RDD[Vector] = {
//TODO...
val toConvert = rawData.collect().toVector
val map = rawData.map {
line => line.split(",").toVector
}
map
}
Any suggestions on how to achieve this?
Thanks in advance.
+4
1 answer
This is a very simple operation, considering that each line of your input file is a hypothetical vector represented by a comma separated line.
You just need to map each entrie line, split it on a separator, and then create a dense vector from it:
val parsedData = rawData.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
+5