How to know when to redistribute / merge RDD with unbalanced partitions (without shuffling possible)?

I download tens of thousands of gzip files from s3 for my spark work. This causes some sections to be very small (10 entries) and some very large (10,000 entries). Partition sizes are fairly well distributed between nodes, so each executor seems to work with the same amount of data in the aggregate. So I'm not sure if I have a problem.

How do I know if it is worth redistributing or merging RDD? Will any of them be able to balance partitions without shuffling data? In addition, RDD will not be reused, simply displayed and then connected to another RDD.

+4
source share
1 answer

Interest Ask. As regards pooling and redistribution , pooling will definitely be better since it does not cause a complete shuffle. In general, sharing is recommended when you have sparse data by section (say, after the filter). I think this is a similar scenario, but directly from the initial load. However, I really think that sharing is likely to be worth what you do with RDD after bootstrapping.

, RDD, Spark , , ( spark.shuffle.manager). : hash ( < 1.2.0) sort ( >= 1.2.0).

hash, , . , , spark.shuffle.consolidateFiles true, , . , , , , .

sort, (whew!), , . Spark . , , .

, , , - , , , , . . , , , .

+6

Source: https://habr.com/ru/post/1614322/


All Articles