Spark Hadoop DistributedCache Function

Question

Spark Hadoop DistributedCache Function

I am looking for functionality similar to Spark's distributed Hadoop cache. I need a relatively small data file (with some index values) to be present in all nodes in order to do some calculations. Is there any approach that makes this possible in Spark?

My workaround so far is to distribute and reduce the index file as normal processing, which takes about 10 seconds in my application. After that, I save the file indicating it as a broadcast variable, as follows:

JavaRDD<String> indexFile = ctx.textFile("s3n://mybucket/input/indexFile.txt",1); ArrayList<String> localIndex = (ArrayList<String>) indexFile.collect(); final Broadcast<ArrayList<String>> globalIndex = ctx.broadcast(indexVar);

This allows the program to understand what the globalIndex variable contains. So far this is a patch that may be okay for me, but I believe that this is not the best solution. Will it still be effective with a significantly larger data set or a large number of variables?

Note. I am using Spark 1.0.0, working in a standalone cluster located in multiple EC2 instances.

+6

hadoop distribute apache-spark distributed-cache

Mikel urkia Sep 2 '14 at 14:20

source share

2 answers

As long as we use Broadcast variables, this should be efficient and with a large dataset.

From the Spark documentation, "Broadcast variables allow a programmer to save a read-only cached variable on each machine, rather than sending copies of it with tasks. They can be used, for example, to give each node a copy of a large input dataset in an efficient way."

0

sras Jan 28 '15 at 13:19

source share

Sai krishna · Accepted Answer · 2016-02-19T00:49:19+0000

Refer to the SparkContext.addFile() method. I think this is what you were looking for.

Spark Hadoop DistributedCache Function

More articles: