What happens when intermediate output doesn't fit in RAM in Spark

I just started learning Spark . In my understanding, Spark stores intermediate output in RAM, so it is very fast compared to Hadoop . Correct me if I am wrong.

My question is: if my intermediate output is 2 GB and my free memory is 1 GB, then what happens in this case? This may be a dumb question, but I did not understand the concept of Spark in mind. Can someone explain me the concept of Spark in mind?

thanks

+5
source share
2 answers

This question asks about keeping RDD in Spark.

You can mark the RDD you want to save using the persist () or cache () methods. The first time it is computed in action, it will be stored in memory on nodes. The spike cache is fault tolerant - if any RDD section is lost, it will be automatically recounted using the transformations that originally created it.

Depending on how you set the storage level for RDD, you can configure different results. For example, if you set the storage level to MEMORY_ONLY (which is the default storage level), your output will store as much as it can in memory and double-check the rest of your RDD on the fly. You can save your RDD and apply your storage level as follows: rdd.persist(MEMORY_ONLY) .

In the example of your example, 1 GB of your output and in memory will be calculated, and another 1 GB will be calculated if necessary for the future step. There are other storage levels that can be set depending on your use case:

  • MEMORY_AND_DISK - calculate the entire RDD, but spill the contents to disk if necessary
  • MEMORY_ONLY_SER , MEMORY_AND_DISK_SER - same as above, but all elements are serialized
  • DISK_ONLY - store all partitions directly on disk
  • MEMORY_ONLY_2 , MEMORY_AND_DISK_2 - the same as above, but partitions are replicated twice for greater tolerance.

Again, you should study your use case to find out what is the best level of storage. In some cases, recalculating the RDD can actually be faster than loading everything from disk. In other cases, a fast serializer can reduce data captured from disk, which will lead to a quick reaction with the data.

+6
source

If I understand your question correctly, I can answer as follows:

The intermediate or temporary storage directory is specified by the spark.local.dir configuration spark.local.dir when configuring the Spark context.

The spark.local.dir directory should use Spark space to β€œscratch” it, including the output map files and RDD files that are stored on disk. [ Ref. Spark configuration .

This should be on the fast local drive on your system. It can also be a comma-separated list of several directories on different drives.

However, the problem you are considering here is also called persistence of RDD. Among the basics you should already know when using Spark caching is also what is called the RDD storage tier, which allows you to use a different storage tier.

This will allow you, for example, to save the data set to disk , save it in memory, but as serialized Java objects (to save space), replicate it by nodes or save it off-heap in Tachyon (this last experimental one). More info here .

Note. These levels are set by passing the StorageLevel object (Scala, Java, Python) to persist . The cache method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY , where Spark stores deserialized objects in memory.

So, to answer your question now,

if my intermediate output is 2 GB and my free memory is 1 GB, then what happens in this case?

I say that it depends on how you configure and configure Spark (application, cluster).

Note. Spark's built-in memory is similar to any built-in memory in a global conceptual sense, the main goal is to avoid heavy and expensive IOs. It also means that if I get back to your question, that if you decide to persist in DISK for the answer, you will lose productivity. More about this in the official documentation indicated in the answer.

+2
source

Source: https://habr.com/ru/post/1233948/


All Articles