Spark + Scala transformations, immutability and memory overhead

I watched several videos on Youtube regarding Spark architecture .

Although the assessment is “lazy,” “Sustainability of data creation in the event of failures,” good functional programming concepts are the reasons for the success of Resilenace distributed datasets, one of the worrying factors is memory overhead due to multiple transformations , which leads to unforeseen costs memory.

If I understand the concept correctly, each transformation creates new data sets, and therefore the memory requirements will go away many times. If I use 10 transformations in my code, 10 sets of data sets will be created, and memory consumption will increase 10 times.

eg.

val textFile = sc.textFile("hdfs://...")
val counts = textFile.flatMap(line => line.split(" "))
                 .map(word => (word, 1))
                 .reduceByKey(_ + _)
counts.saveAsTextFile("hdfs://...")

In the above example, the three transformations: . Does this mean that I need 3X data memory for the X-size of the data? flatMap, map and reduceByKey

Do I understand correctly? Is RDD caching the only solution to solve this problem?

Once I start caching, it may spill over to disk due to its large size, and performance will depend on disk I / O. In this case, the performance of Hadoop and Spark are comparable?

EDIT:

. 3- , X - RDD, .

1 X RDD pipleline? cache()?

+4
2

-, , :

scala> val rdd = sc.makeRDD(List("This is a test", "This is another test", 
                                 "And yet another test"), 1)
rdd: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[70] at makeRDD at <console>:27

scala> val counts = rdd.flatMap(line => {println(line);line.split(" ")}).
     | map(word => {println(word);(word,1)}).
     | reduceByKey((x,y) => {println(s"$x+$y");x+y}).
     | collect
This is a test
This
is
a
test
This is another test
This
1+1
is
1+1
another
test
1+1
And yet another test
And
yet
another
1+1
test
2+1
counts: Array[(String, Int)] = Array((And,1), (is,2), (another,2), (a,1), (This,2), (yet,1), (test,3))

, parallelism 1, , . println , , . , , , . , , . , , DAG :

DAG

. Spark v Hadoop, ( google it), , Spark , . , , , API DataFrames.

, , Spark MR .

+8

Spark 10 , 10 Spark. , Spark DAG, . . , Spark .

RDD, Spark DAG. , , , 10 , ( , , ).

.

+1

Source: https://habr.com/ru/post/1626957/


All Articles