Equivalent of distributed cache to sparks?

Question

Equivalent of distributed cache to sparks?

In Hadoop, you can use a distributed cache to copy read-only files on each node. What is the equivalent way to do this in Spark? I know about broadcast variables, but this is only useful for variables, not files.

+6

java scala hadoop apache-spark

Metallicpriest Jun 25 '15 at 0:07

source share

2 answers

Piotr rudnicki · Answer 1 · 2015-06-25T06:57:50+0000

Take a look at SparkContext.addFile ()

Add a file to be uploaded using this Spark job on each node. The path covered can be either a local file, or a file in HDFS (or other Hadoop-supported file systems), or HTTP, HTTPS, or FTP URI. To access a file in Spark jobs, use SparkFiles.get (file_name) to find its location to download.
If the recursive parameter is set to true, a directory may be specified. Directories are currently only supported for Hadoop-supported file systems.

Gstomar · Answer 2 · 2017-03-12T06:10:39+0000

If your files are text files living in HDFS, then you can use:

textFile("<hdfs-path>") "SparkContext".

this call will give you an RDD that you can save across nodes using the: persist() method of this RDD.

this method can save file data (serialized / deserialized) to MEMORY / DISK.

:

http://spark.apache.org/docs/latest/programming-guide.html#which-storage-level-to-choose

Equivalent of distributed cache to sparks?

More articles: