(1) the variable a will be saved as an RDD variable containing the expected contents of the txt file
(Emphasizing mine) Not really. The line describes what happens after you complete the action, that is, the RDD variable does not contain the expected contents of the txt file.
RDD describes sections that, when called, become tasks that will read parts of the input file.
(2) The node driver breaks down the work into tasks, and each task contains information about the split of the data on which it will work. Now these tasks are assigned to work nodes.
Yes, but only when the action c=b.collect() is called in your case.
(3) when the action of the collection is called (for example, collect () in our case), the results will be returned to the master from different nodes and saved as a local variable c.
YES! This is the most dangerous RAM, since all Spark artists working somewhere in the cluster start sending data back to the driver.
Now I want to understand what difference below code makes
Specifying sc.textFile documentation:
textFile (path: String, minPartitions: Int = defaultMinPartitions): RDD [String] Read the text file from HDFS, the local file system (available on all nodes), or any Hadoop-supported URI file system and return it as RDD string.
Sc.parallelize documentation job :
Parallelize [T] (seq: Seq [T], numSlices: Int = defaultParallelism) (implicit arg0: ClassTag [T]): RDD [T] Distribute the local Scala collection to form an RDD.
The difference is in the data sets - files (for textFile ) in the local collection (for parallelize ). Or they do the same under covers, that is, they both build a description of how to access the data that will be processed using transformations and actions.
So the main difference is the data source.