Equivalent getLines in Apache Spark RDD

Question

Equivalent getLines in Apache Spark RDD

I have a Scala program that works fine on a single computer. However, I would like to make it work with multiple nodes.

The start of the program is as follows:

val filename = Source.fromFile("file://...") val lines = filename.getLines val linesArray = lines.map(x => x.split(" ").slice(0, 3)) val mapAsStrings = linesArray.toList.groupBy(_(0)).mapValues(x => x.map(_.tail)) val mappedUsers = mapAsStrings map {case (k,v) => k -> v.map(x => x(0) -> x(1).toInt).toMap}

When I try to use Spark to run the program, I know that I need a SparkContext and SparkConf , and they are used to create the RDD .

So now I have:

 class myApp(filePath: String) { private val conf = new SparkConf().setAppName("myApp") private val sc = new SparkContext(conf) private val inputData = sc.textFile(filePath)

inputData now an RDD , its equivalent in the previous program was filename (I guess). For RDD methods are different. So what is equivalent to getLines ? Or is there no equivalent? I find it difficult to understand what RDD offers me, for example, inputData an Array[String] or something else?

thanks

+5

scala apache-spark

monster Dec 10 '14 at 18:30

source share

2 answers

the documentation seems to answer this directly:

 def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String]

Read the text file from HDFS, the local file system (available on all nodes), or any Hadoop-supported URI file system and return it as an RDD string.

So textFile is the equivalent of both fromFile and getLines , and returns an RDD where each entry is a line from the file. inputData is the equivalent of linesArray

+2

The archetypal paul Dec 10 '14 at 18:34

source share

maasg · Accepted Answer · 2014-12-10T21:32:56+0000

RDD is a distributed collection, so conceptually it is not very different from List, Array or Seq, providing you with functional operations that allow you to transform a collection of elements. The main difference from Scala collections is that RDD is in internal distribution. Given the Spark cluster, when an RDD is created, the collection it represents is broken down into some nodes in that cluster.

rdd.textFile(...) returns RDD[String] . Given a distributed file system, each worker uploads a part or this file to a “partition” where further conversions and actions can be performed (in Spark lingo).

Given that the Spark API closely resembles the Scala collections API as soon as you have the RDD, applying functional transformations on it is very similar to what you will do with the Scala collection.

Therefore, your Scala program can be easily ported to Spark:

 //val filename = Source.fromFile("file://...") //val lines = filename.getLines val rdd = sc.textFile("file://...") //val linesArray = lines.map(x => x.split(" ").slice(0, 3)) val lines = rdd.map(x => x.split(" ").slice(0, 3)) //val mapAsStrings = linesArray.toList.groupBy(_(0)).mapValues(x => x.map(_.tail)) val mappedLines = lines.groupBy(_(0)).mapValues(x => x.map(_.tail)) //val mappedUsers = mapAsStrings map {case (k,v) => k -> v.map(x => x(0) -> x(1).toInt).toMap} val mappedUsers = mappedLines.mapValues{v => v.map(x => x(0) -> x(1).toInt).toMap}

An important difference is the lack of an associative Map collection as an RDD. Therefore, mappedUsers is a collection of tuples (String, Map[String,String])

Equivalent getLines in Apache Spark RDD

More articles: