Interest Ask. As regards pooling and redistribution , pooling will definitely be better since it does not cause a complete shuffle. In general, sharing is recommended when you have sparse data by section (say, after the filter). I think this is a similar scenario, but directly from the initial load. However, I really think that sharing is likely to be worth what you do with RDD after bootstrapping.
, RDD, Spark , , ( spark.shuffle.manager). : hash ( < 1.2.0) sort ( >= 1.2.0).
hash, , . , , spark.shuffle.consolidateFiles true, , . , , , , .
sort, (whew!), , . Spark . , , .
, , , - , , , , . . , , , .