How to convert RDD <String> to RDD <Vector> in Spark?

I have a file where each line is in this way

info1,info2
info3,info4
...

After scanning, I want to run the k-means algorithm:

  val rawData = sc.textFile(myFile)
  val converted = convertToVector(rawData)
  val kmeans = new KMeans()
  kmeans.setK(10)
  kmeans.setRuns(10)
  kmeans.setEpsilon(1.0e-6)
  val model = kmeans.run(rawData) -> problem: k-means accepts only RDD<Vector>

Since k-mean only accepts RDD<Vector>, I created a function that converts mine RDD<String> rawDatato RDD<Vector>. But I am fixated on how to do this, this function below is in the process:

def converToVector(rawData: RDD[String]): RDD[Vector] = {

    //TODO...
    val toConvert = rawData.collect().toVector
    val map = rawData.map {
      line => line.split(",").toVector
    }

    map
  }

Any suggestions on how to achieve this?

Thanks in advance.

+4
source share
1 answer

This is a very simple operation, considering that each line of your input file is a hypothetical vector represented by a comma separated line.

You just need to map each entrie line, split it on a separator, and then create a dense vector from it:

val parsedData = rawData.map(s => Vectors.dense(s.split(',').map(_.toDouble)))
+5
source

Source: https://habr.com/ru/post/1598177/


All Articles