Monitor various memory usage blocks in Spark and find out what ends in OOM?

Apache Spark has 3 memory blocks:

  • The cache is where the RDDs are put when you call cacheorpersist
  • In random order. This is a memory block used for shuffling operations (grouping, remaking and reduceByKey.
  • Heap. This is where ordinary JVM objects are stored.

Now I would like to control the amount of memory used as% of each block by the task, so that I can know that I have to configure these numbers so that Cache and Shuffle do not spill onto the disk and that the heap is not OOM. For instance. every few seconds I get an update like:

Cache: 40% use (40/100 GB)
Shuffle: 90% use (45/50 GB)
Heap: 10% use (1/10 GB)

I know that I can experiment to find sweet spots using other methods, but I find it very difficult and I can just follow the usage, which will greatly simplify the work of writing and customizing Spark.

+4
source share

Source: https://habr.com/ru/post/1544363/


All Articles