What are the differences between sc.parallelize and sc.textFile?

Question

What are the differences between sc.parallelize and sc.textFile?

I am new to Spark. can someone please clear my doubts:

Let's assume the following is my code:

a = sc.textFile(filename) b = a.filter(lambda x: len(x)>0 and x.split("\t").count("111")) c = b.collect()

Hope that what happens inside the country happens below: (Please correct if my understanding is incorrect)

(1) the variable a will be saved as an RDD variable containing the expected content of the txt file

(2) The node driver breaks down the work into tasks, and each task contains information about the split of the data on which it will work. Now these tasks are assigned to work nodes.

(3) when the action of the collection is called (for example, collect () in our case), the results will be returned to the master from different nodes and saved as a local variable c.

Now I want to understand what the difference is below the code:

 a = sc.textFile(filename).collect() b = sc.parallelize(a).filter(lambda x: len(x)>0 and x.split("\t").count("111")) c = b.collect()

Can someone clarify?

+5

apache-spark pyspark rdd

user2531569 Jul 01 '17 at 12:30

source share

1 answer

Jacek laskowski · Accepted Answer · 2017-07-01T14:01:22+0000

(1) the variable a will be saved as an RDD variable containing the expected contents of the txt file

(Emphasizing mine) Not really. The line describes what happens after you complete the action, that is, the RDD variable does not contain the expected contents of the txt file.

RDD describes sections that, when called, become tasks that will read parts of the input file.

(2) The node driver breaks down the work into tasks, and each task contains information about the split of the data on which it will work. Now these tasks are assigned to work nodes.

Yes, but only when the action c=b.collect() is called in your case.

(3) when the action of the collection is called (for example, collect () in our case), the results will be returned to the master from different nodes and saved as a local variable c.

YES! This is the most dangerous RAM, since all Spark artists working somewhere in the cluster start sending data back to the driver.

Now I want to understand what difference below code makes

Specifying sc.textFile documentation:

textFile (path: String, minPartitions: Int = defaultMinPartitions): RDD [String] Read the text file from HDFS, the local file system (available on all nodes), or any Hadoop-supported URI file system and return it as RDD string.

Sc.parallelize documentation job :

Parallelize [T] (seq: Seq [T], numSlices: Int = defaultParallelism) (implicit arg0: ClassTag [T]): RDD [T] Distribute the local Scala collection to form an RDD.

The difference is in the data sets - files (for textFile ) in the local collection (for parallelize ). Or they do the same under covers, that is, they both build a description of how to access the data that will be processed using transformations and actions.

So the main difference is the data source.

What are the differences between sc.parallelize and sc.textFile?

More articles: