Apache Spark has 3 memory blocks:
- The cache is where the RDDs are put when you call
cacheorpersist - In random order. This is a memory block used for shuffling operations (grouping, remaking and
reduceByKey. - Heap. This is where ordinary JVM objects are stored.
Now I would like to control the amount of memory used as% of each block by the task, so that I can know that I have to configure these numbers so that Cache and Shuffle do not spill onto the disk and that the heap is not OOM. For instance. every few seconds I get an update like:
Cache: 40% use (40/100 GB)
Shuffle: 90% use (45/50 GB)
Heap: 10% use (1/10 GB)
I know that I can experiment to find sweet spots using other methods, but I find it very difficult and I can just follow the usage, which will greatly simplify the work of writing and customizing Spark.
source
share