Apache Spark: user memory and spark memory

Question

Apache Spark: user memory and spark memory

I am creating a Spark application where I need to cache about 15 GB of CSV files. I read about the new UnifiedMemoryManagerintroduced in Spark 1.6, here:

https://0x0fff.com/spark-memory-management/

It also shows this picture:

The author is different between User Memoryand User Memory Spark Memory(which is again divided into Storage and Execution Memory). As I understand it, Spark Memory is flexible for executing (shuffling, sorting, etc.) and storing (caching) content - if more memory is required, it can use it from another part (if it has not been fully used yet). Is this assumption true?

User memory is described as follows:

User memory. This is the memory pool that remains after Spark Memory is allocated, and you can use it the way you like. There you can store your own data structures that will be used in RDD transformations. For example, you can rewrite Spark aggregation using the mapPartitions transform hash table to run this aggregation, which consumes the so-called user memory. [...] And again, this is User Memory and it is completely up to you what will be stored in this RAM and how, Spark completely does not take into account what you are doing there and whether you observe this boundary or not. Failure to comply with this boundary in your code may result in an OOM error.

How can I access this part of memory or how does Spark control it?

( , , ..)? , spark.memory.storageFraction 1.0?

, ? , , ?

, , , RDD<MyOwnRepresentationClass> RDD<String>?

( Livy Client . Spark 1.6.2 Kryo).

JavaRDD<String> inputRDD = sc.textFile(inputFile);

// Filter out invalid values
JavaRDD<String> cachedRDD = inputRDD.filter(new Function<String, Boolean>() {
    @Override
    public Boolean call(String row) throws Exception {
        String[] parts = row.split(";");

        // Some filtering stuff

        return hasFailure;
    }
}).persist(StorageLevel.MEMORY_ONLY_SER());

+7

memory-management caching memory apache-spark rdd

D. Müller 03 '17 9:46

2

dalwinder singh · Answer 1 · 2019-04-16T05:20:36+0000

1) HEAP: JVM GC.

2) OFF HEAP: JVM , , GC. , , .

:

: Spark, RDD, , ..

/ : , , , ..

: , , .

: Spark.

OFF HEAP MEMORY: - 1) ( ) 2)

Salim · Answer 2 · 2017-12-01T00:24:20+0000

, , 100% . https://spark.apache.org/docs/latest/tuning.html#memory-management-overview

" " " ". , , , .. , , . , JVM Java. . . , .

" " , , dataset.cache dataset.persist. , .

spark.memory.storageFraction 1. 0.5. - . , , . , .

Apache Spark: user memory and spark memory

More articles: