Forced breakdown is stored on a specific artist

Question

Forced breakdown is stored on a specific artist

I have 5-paris-RDD and 5 workers / performers. How can I ask Spark to save each section of RDD for different working (ip)?

Am I right if I say that Spark can save several partitions per worker and 0 partiotions on another worker? Meas I can specify the number of partitions, but Spark can still cache everything on one node.

Replication is not an option since RDD is huge.

Workarounds I found

getPreferredLocations

The RDD getPreferredLocations method getPreferredLocations not provide a 100% guarantee that the section will be stored in the specified node. Spark will try during spark.locality.wait , but then Spark will cache the part on different nodes.

As a desktop, you can set the spark.locality.wait to a very high value and override getPreferredLocations . The bad news is you can't do it with Java, you need to write Scala code. At least Scala internals wrapped in Java code. I.e:

 class NodeAffinityRDD[U: ClassTag](prev: RDD[U]) extends RDD[U](prev) { val nodeIPs = Array("192.168.2.140","192.168.2.157","192.168.2.77") override def getPreferredLocations(split: Partition): Seq[String] = Seq(nodeIPs(split.index % nodeIPs.length)) }

SparkContext makeRDD

SparkContext has a makeRDD method . There is no documentation in this method. As far as I understand, I can specify preferred locations and set the value to spark.locality.wait . The bad news is that the preferred location will be dropped on the first shuffle / join / cogroup operation .

Both approaches have the disadvantage of spark.locality.wait being too high, which can make your cluster resilient if some of the nodes are unavailable.

PS More context

I have up to 10,000 sales-XXX.parquet , each of which represents sales of different goods in different regions. Each sales-XXX.parquet can vary from a few KB to a few GB. All sales-XXX.parquet together can accept up to tens or hundreds of GBs on HDFS. I need a full text search on all sales. I have to index each sales-XXX.parquet one by one with Lucene. And now I have two options:

Keep Lucene indices in Spark. There is already a solution for this , but it looks rather suspicious. Are there any better solutions?
Store Lucene indexes on the local file system. How can I reduce the display by the search results of each working index. But this approach requires that every working node maintain an equal amount of data. How can I guarantee that Spark will store the same amount of data for each working node?

+1

java scala shuffle apache-spark rdd

Volodymyr bakhmatiuk Mar 03 '17 at 0:34

source share

No one has answered this question yet.

See similar questions:

6

In Apache Spark, is it possible to specify the preferred partition location for shuffled RDD or cogrouped RDD?

one

How to transfer one section to a specific artist in Spark?

or similar:

3324