Spark subtractByKey increases memory size in RDD cache

I found a very strange behavior for Apache Spark RDD (spark 1.6.0 with scala 2.11):

When I use subtractByKey on an RDD, the resulting RDD should be equal or smaller. I get an RDD that takes up even more space in memory:

//Initialize first RDD val rdd1 = sc.parallelize(Array((1,1),(2,2),(3,3))).cache() //dummy action to cache it => size according to webgui: 184 Bytes rdd1.first //Initialize RDD to subtract (empty RDD should result in no change for rdd1) val rdd2 = sc.parallelize(Array[(Int,Int)]()) //perform subtraction val rdd3 = rdd1.subtractByKey(rdd2).cache() //dummy action to cache rdd3 => size according to webgui: 208 Bytes rdd3.first 

I understood this strange behavior for RDD ~ 200K rows and 1.3 GB in size, which were increased to more than 2 GB after subtraction

Edit: I tried the example above with a lot of values ​​(10k) => of the same behavior. Size increases by about 1.6 times. Also reduces the effect ByKey seems to have a similar effect.

When I create RDD with

 sc.paralellize(rdd3.collect()) 

the size is the same as for rdd3, so the increased size is carried over even if it is extracted from the RDD.

+5
source share

Source: https://habr.com/ru/post/1243258/


All Articles