I found a very strange behavior for Apache Spark RDD (spark 1.6.0 with scala 2.11):
When I use subtractByKey on an RDD, the resulting RDD should be equal or smaller. I get an RDD that takes up even more space in memory:
//Initialize first RDD val rdd1 = sc.parallelize(Array((1,1),(2,2),(3,3))).cache() //dummy action to cache it => size according to webgui: 184 Bytes rdd1.first //Initialize RDD to subtract (empty RDD should result in no change for rdd1) val rdd2 = sc.parallelize(Array[(Int,Int)]()) //perform subtraction val rdd3 = rdd1.subtractByKey(rdd2).cache() //dummy action to cache rdd3 => size according to webgui: 208 Bytes rdd3.first
I understood this strange behavior for RDD ~ 200K rows and 1.3 GB in size, which were increased to more than 2 GB after subtraction
Edit: I tried the example above with a lot of values (10k) => of the same behavior. Size increases by about 1.6 times. Also reduces the effect ByKey seems to have a similar effect.
When I create RDD with
sc.paralellize(rdd3.collect())
the size is the same as for rdd3, so the increased size is carried over even if it is extracted from the RDD.
source share