Spark in Java - What is the right way to static object for all workers

Question

Spark in Java - What is the right way to static object for all workers

I need to use a non-serializable third-party class in my functions for all artists in Spark, for example:

JavaRDD<String> resRdd = origRdd .flatMap(new FlatMapFunction<String, String>() { @Override public Iterable<String> call(String t) throws Exception { //A DynamoDB mapper I don't want to initialise every time DynamoDBMapper mapper = new DynamoDBMapper(new AmazonDynamoDBClient(credentials)); Set<String> userFav = mapper.load(userDataDocument.class, userId).getFav(); return userFav; } });

I would like to have a static DynamoDBMapper mapper , which I initialize once for each artist and can use it again and again.

Since this is not serializable, I cannot initialize it once on disk and broadcast it.

Note: this is the answer here ( What is the correct way of a static object for all workers ), but it is only for Scala.

+5

java static apache-spark

Roee gavirel Jan 26 '16 at 15:51

source share

1 answer

Alex naspo · Accepted Answer · 2016-01-27T15:10:01+0000

You can use mapPartition or foreachPartition . Here is a snippet taken from Learning Spark

Using partition-based operations, we can share the connection pool to this database so as not to make many connections and reuse our JSON parser. As examples 6-10 through 6-12 show, we use mapPartitions (), which gives us an iterator of elements in each section of the input RDD and expects us to return an iterator of our results.

This allows us to initialize one connection for each artist, and then iterate over the elements in the section as you would like. This is very useful for saving data to some external database or for creating an expensive reusable object.

Here is a simple scala example taken from a related book. This can be translated into java if necessary. Just here to show a simple example using mapPartition and foreachPartition.

 ipAddressRequestCount.foreachRDD { rdd => rdd.foreachPartition { partition => // Open connection to storage system (eg a database connection) partition.foreach { item => // Use connection to push item to system } // Close connection } }

Here is a link to a java example.

Spark in Java - What is the right way to static object for all workers

More articles: