Spark artists use some of the memory for caching ( spark.storage.memoryFraction ) and some memory for shuffling ( spark.shuffle.memoryFraction ). The rest is available for use by application code, for example, which is executed in the RDD.map operation.
I would like to know the amount of this useful memory. (I want to have large partitions that still fit in memory. I want to divide the size of the data into usable memory for each partition to get the number of partitions.)
This is how I calculate it:
val numExecutors = sc.getExecutorStorageStatus.size - 1 // Exclude driver. val totalCores = numExecutors * numCoresPerExecutor val cacheMemory = sc.getExecutorMemoryStatus.values.map(_._1).sum val conf = sc.getConf val cacheFraction = conf.getDouble("spark.storage.memoryFraction", 0.6) val shuffleFraction = conf.getDouble("spark.shuffle.memoryFraction", 0.2) val workFraction = 1.0 - cacheFraction - shuffleFraction val workMemory = workFraction * cacheMemory / cacheFraction val workMemoryPerCore = workMemory / totalCores
I am sure you will agree that this is terrible. Worst of all, if the default changes to Spark, my result will be wrong. But the default values ββare hardcoded in Spark. I have no way to get to them.
Is there a better way to get workMemoryPerCore ?
source share