If I understand your question correctly, I can answer as follows:
The intermediate or temporary storage directory is specified by the spark.local.dir configuration spark.local.dir when configuring the Spark context.
The spark.local.dir directory should use Spark space to βscratchβ it, including the output map files and RDD files that are stored on disk. [ Ref. Spark configuration .
This should be on the fast local drive on your system. It can also be a comma-separated list of several directories on different drives.
However, the problem you are considering here is also called persistence of RDD. Among the basics you should already know when using Spark caching is also what is called the RDD storage tier, which allows you to use a different storage tier.
This will allow you, for example, to save the data set to disk , save it in memory, but as serialized Java objects (to save space), replicate it by nodes or save it off-heap in Tachyon (this last experimental one). More info here .
Note. These levels are set by passing the StorageLevel object (Scala, Java, Python) to persist . The cache method is a shorthand for using the default storage level, which is StorageLevel.MEMORY_ONLY , where Spark stores deserialized objects in memory.
So, to answer your question now,
if my intermediate output is 2 GB and my free memory is 1 GB, then what happens in this case?
I say that it depends on how you configure and configure Spark (application, cluster).
Note. Spark's built-in memory is similar to any built-in memory in a global conceptual sense, the main goal is to avoid heavy and expensive IOs. It also means that if I get back to your question, that if you decide to persist in DISK for the answer, you will lose productivity. More about this in the official documentation indicated in the answer.