RDD is a distributed collection, so conceptually it is not very different from List, Array or Seq, providing you with functional operations that allow you to transform a collection of elements. The main difference from Scala collections is that RDD is in internal distribution. Given the Spark cluster, when an RDD is created, the collection it represents is broken down into some nodes in that cluster.
rdd.textFile(...) returns RDD[String] . Given a distributed file system, each worker uploads a part or this file to a โpartitionโ where further conversions and actions can be performed (in Spark lingo).
Given that the Spark API closely resembles the Scala collections API as soon as you have the RDD, applying functional transformations on it is very similar to what you will do with the Scala collection.
Therefore, your Scala program can be easily ported to Spark:
//val filename = Source.fromFile("file://...") //val lines = filename.getLines val rdd = sc.textFile("file://...") //val linesArray = lines.map(x => x.split(" ").slice(0, 3)) val lines = rdd.map(x => x.split(" ").slice(0, 3)) //val mapAsStrings = linesArray.toList.groupBy(_(0)).mapValues(x => x.map(_.tail)) val mappedLines = lines.groupBy(_(0)).mapValues(x => x.map(_.tail)) //val mappedUsers = mapAsStrings map {case (k,v) => k -> v.map(x => x(0) -> x(1).toInt).toMap} val mappedUsers = mappedLines.mapValues{v => v.map(x => x(0) -> x(1).toInt).toMap}
An important difference is the lack of an associative Map collection as an RDD. Therefore, mappedUsers is a collection of tuples (String, Map[String,String])
maasg source share