Where does Spark store data when the storage tier is set to disk?

I was wondering in which Spark directory the data is stored if the storage level is set to DISK_ONLY or MEMORY_AND_DISK (data that in this case does not fit into memory). Because I see that it does not matter at what level I am set up. If a program crashes with the MEMORY_ONLY level, it also crashes with all other levels.

In the cluster that I use, the / tmp directory is a RAM disk and therefore limited in size. Is Spark trying to store disk-level data on this disk? Maybe that’s why I don’t see the difference. If this is true, how can I change this default behavior? If I use the yarn cluster that comes with Hadoop, do I need to change the / tmp folder in hadoop configuration files or just change spark.local.dir using Spark?

+4
source share
1 answer

Yes Spark binds to store disk level data on this disk.

, Spark Spark, , YARN (Hadoop YARN config yarn.nodemanager.local-dirs). spark.local.dir, .

: https://spark.apache.org/docs/latest/running-on-yarn.html#important-notes

, , , yarn.nodemanager.local-dirs

+2

Source: https://habr.com/ru/post/1607812/


All Articles