I am looking for functionality similar to Spark's distributed Hadoop cache. I need a relatively small data file (with some index values) to be present in all nodes in order to do some calculations. Is there any approach that makes this possible in Spark?
My workaround so far is to distribute and reduce the index file as normal processing, which takes about 10 seconds in my application. After that, I save the file indicating it as a broadcast variable, as follows:
JavaRDD<String> indexFile = ctx.textFile("s3n://mybucket/input/indexFile.txt",1); ArrayList<String> localIndex = (ArrayList<String>) indexFile.collect(); final Broadcast<ArrayList<String>> globalIndex = ctx.broadcast(indexVar);
This allows the program to understand what the globalIndex variable contains. So far this is a patch that may be okay for me, but I believe that this is not the best solution. Will it still be effective with a significantly larger data set or a large number of variables?
Note. I am using Spark 1.0.0, working in a standalone cluster located in multiple EC2 instances.
source share