How long does RDD remain in memory?

Given the limited memory, I had the feeling that a spark automatically removes the RDD from each node. I would like to know if this time is configurable? How a spark decides when to release RDD from memory.

Note. I'm not talking aboutrdd.cache()

+6
source share
4 answers

I would like to know is this time customizable? How a spark decides when to expel SDRs from memory

RDD- This is an object, like any other. If you do not save / save it in the cache, it will act like any other object in a managed language, and will be collected when there are no live root objects on it.

"", @Jacek, ContextCleaner. , , :

private def keepCleaning(): Unit = Utils.tryOrStopSparkContext(sc) {
  while (!stopped) {
    try {
      val reference = Option(referenceQueue.remove(ContextCleaner.REF_QUEUE_POLL_TIMEOUT))
          .map(_.asInstanceOf[CleanupTaskWeakReference])
      // Synchronize here to avoid being interrupted on stop()
      synchronized {
        reference.foreach { ref =>
          logDebug("Got cleaning task " + ref.task)
          referenceBuffer.remove(ref)
          ref.task match {
            case CleanRDD(rddId) =>
              doCleanupRDD(rddId, blocking = blockOnCleanupTasks)
            case CleanShuffle(shuffleId) =>
              doCleanupShuffle(shuffleId, blocking = blockOnShuffleCleanupTasks)
            case CleanBroadcast(broadcastId) =>
              doCleanupBroadcast(broadcastId, blocking = blockOnCleanupTasks)
            case CleanAccum(accId) =>
              doCleanupAccum(accId, blocking = blockOnCleanupTasks)
            case CleanCheckpoint(rddId) =>
              doCleanCheckpoint(rddId)
            }
          }
        }
      } catch {
        case ie: InterruptedException if stopped => // ignore
        case e: Exception => logError("Error in cleaning thread", e)
    }
  }
}

, Sparks , , @Jacek " Apache Spark" ( ContextCleaner)

+7

, , " ", ... ( "", ?)

Spark , ( , Spark). , . - , , ?

, ContextCleaner . Spark Application Garbage Collector, , RDD, , RDD, Spark, .

ContextCleaner . SparkContext ( spark.cleaner.referenceTracking Spark , ). , SparkContext .

, , Spark jconsole jstack. ContextCleaner Spark Context Cleaner, RDD, shuffle broadcast.

, INFO DEBUG org.apache.spark.ContextCleaner. conf/log4j.properties:

log4j.logger.org.apache.spark.ContextCleaner=DEBUG
+5

GC

GC , , , GC. , -verbose: gc -XX: + PrintGCDetails -XX: + PrintGCTimeStamps Java. (. Java Spark jobs.) , Spark, , , . , ( stdout ), .

GC

, JVM:

Java : . , .

[, Survivor1, Survivor2].

. Eden , Eden GC, , Eden Survivor1, Survivor2. , , . Survivor2 , "". , Old , GC.

+1

Resilient Distributed Data-set -

Our work nodes cache in-memory RDD sections as Java objects. We use the policy of replacing LRU with the RDD level (i.e. we do not evict partitions from RDD to load other partitions from the same RDD), since most operations are scanning. We found this a simple policy to work well in all of our user applications, so far away. Programmers who want more control can also set the storage priority for each RDD as an argument for caching.

+1
source

Source: https://habr.com/ru/post/1673704/


All Articles