Equivalent of distributed cache to sparks?

In Hadoop, you can use a distributed cache to copy read-only files on each node. What is the equivalent way to do this in Spark? I know about broadcast variables, but this is only useful for variables, not files.

+6
source share
2 answers

Take a look at SparkContext.addFile ()

Add a file to be uploaded using this Spark job on each node. The path covered can be either a local file, or a file in HDFS (or other Hadoop-supported file systems), or HTTP, HTTPS, or FTP URI. To access a file in Spark jobs, use SparkFiles.get (file_name) to find its location to download.

If the recursive parameter is set to true, a directory may be specified. Directories are currently only supported for Hadoop-supported file systems.

+6
source

If your files are text files living in HDFS, then you can use:

textFile("<hdfs-path>") "SparkContext".

this call will give you an RDD that you can save across nodes using the: persist() method of this RDD.

this method can save file data (serialized / deserialized) to MEMORY / DISK.

:

http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

-3
source

Source: https://habr.com/ru/post/989703/


All Articles